Building Windsurf with Varun Mohan — The Pragmatic Engineer

Windsurf is an AI-native IDE built by a team with deep infrastructure and autonomous vehicle backgrounds, and this episode explores the engineering challenges behind making AI coding tools work at scale. Varun Mohan, co-founder and CEO, explains why the team built their own LLMs, how they handle retrieval and latency, and why he believes AI will increase—not decrease—the value of software engineers.

How Windsurf evaluates new models

The team treats model evaluation like autonomous vehicle testing: because models are non-deterministic, they built extensive simulation-style evaluation suites rather than relying on public benchmarks.
Their evaluation infrastructure tests multiple layers:
- Retrieval accuracy: did the model find the right files?
- High-level intent: did it correctly understand what the commit should do?
- Edit accuracy: did it make the correct changes?
- Redundant changes: did it avoid unnecessary steps that slow down the developer?
They use real open-source repositories with known commits, reconstruct high-level descriptions of what each commit intended, and then programmatically verify whether the model retrieves the right files, forms the right intent, and produces correct edits.
This approach lets them evaluate across tens of thousands of repositories in minutes, rather than relying on slow human A/B testing.

Windsurf’s origin story

The company started nearly four years ago (before the ChatGPT moment) as Codeium, originally building GPU virtualization infrastructure for deep learning workloads in robotics and autonomous vehicles.
When text-davinci-03 emerged in mid-2022, the founders realized the model landscape was consolidating around large generative models, so they pivoted to the application layer.
They built autocomplete first because existing open-source models (like Salesforce’s CodeGen) were missing critical capabilities for code, especially fill-in-the-middle—the ability to predict code that goes in the middle of a line or function, not just at the end.
Fill-in-the-middle requires specialized training because code tokenized mid-line is out-of-distribution for standard language models; for example, a model trained on return with a newline will never have seen return without one.
They provided autocomplete for free to all developers, then added enterprise features like security and personalization.
In 2024, they built Windsurf as a fork of VS Code (specifically Code OSS, the open-source base) because VS Code wasn’t evolving fast enough to support agentic workflows where AI writes most of the code but developers stay in the loop.

The current scale of Windsurf

The engineering team is just over 50 people, with close to half working on projects not yet shipped.
Within months of launch, over a million developers tried the product; within a month of pricing, they reached eight figures in ARR.
Their own models process over 500 billion tokens of code per day for retrieval and autocomplete.
They still run their own inference stack in many places, including fast autocomplete and codebase retrieval models.

Why they built their own models

Beyond fill-in-the-middle, code has unique properties that benefit from custom models:
- Code is parseable, enabling abstract syntax tree analysis and knowledge graph construction from commit history.
- Commit histories reveal conditional probabilities: if you change function X, you’re likely to change function Y.
- Pull request comments provide signal on what good and bad code looks like within a specific company.
These properties let them build rich understanding of codebases that generic text models can’t achieve out of the box.

Fine-tuning on company codebases

They built infrastructure to fine-tune models on customer codebases, including a system that performed backpropagation on transformer layers during idle GPU time, with preemption at every layer boundary so inference requests wouldn’t be delayed.
They found fine-tuning gave only a modest bump compared to better retrieval and personalization.
Varun’s framework: climb the easier hills first (better retrieval) before investing in harder ones (fine-tuning), to avoid unnecessary technical complexity.
He believes per-person fine-tuning could still become valuable with different techniques.

The scaling challenge: growing codebases and context windows

As projects grow—whether from AI-generated code or organic development—models struggle with limited context windows.
There is no single solution; Windsurf is working on a combination of:
- Better checkpointing of long conversations
- More efficient use of context length
- Faster LLMs
- Better retrieval using knowledge graphs and codebase dependency analysis
Varun emphasizes that when you make a developer wait for a response, the answer needs to be nearly correct—and current models aren’t reliable enough to guarantee that for complex, long-context tasks.

Infrastructure and latency optimization

Windsurf maintains tight control over their stack partly because enterprise customers (like Dell, their first large customer) have strict requirements around subprocessors and data handling.
They achieved FedRAMP High compliance—the only AI coding assistant to do so—by keeping their systems minimal and auditable.
Latency is their top engineering challenge:
- They target sub-200-millisecond time-to-first-token and hundreds of tokens per second generation, far exceeding typical API provider performance.
- GPUs have roughly two orders of magnitude more compute than CPUs but only one order of magnitude more memory bandwidth, so non-compute-intensive operations become memory-bound.
- To maximize GPU utilization, they need to batch requests, but batching increases latency—creating a fundamental tradeoff.
- A 10-millisecond increase in latency measurably reduces user willingness to use the product (by percentage points).
Data center placement matters for speed of light reasons, though in some regions (like India), local network congestion is a bigger bottleneck than physical distance.

Codebase indexing and retrieval

Windsurf uses a hybrid retrieval approach combining:
- Embedding-based search
- Keyword search
- Knowledge graph traversal (from commit history and AST dependencies)
They do significant computation at retrieval time, fusing results from multiple methods and then processing large chunks of the codebase during inference to identify the most relevant snippets.
This gives them much higher precision and recall than embedding search alone, which is lossy and struggles with comprehensive results for specific queries (e.g., “find all uses of this Spring Boot function”).
For local indexing, they use a combination of a local SQL database and embedding databases on the user’s machine, which helps track user history and recent changes.
For remote data, they keep it simple with PostgreSQL, since their QPS is manageable—each query is expensive (trillions of operations), so the challenge is optimization, not raw throughput.
On embedding databases specifically: Varun sees them as one tool in the toolkit, not a standalone solution. They improve recall (especially for handling typos and semantic similarity) but can’t replace keyword or graph-based methods.

Balancing present and long-term: “the split brain situation”

Varun and his co-founder focus on long-term strategic direction and self-disruption, while the engineering team ships incremental improvements.
They embrace failure as a learning mechanism:
- Their agent work (which became Cascade) didn’t work for many months before models improved enough.
- They shipped a code review tool called Forge that users didn’t find useful, so they killed it.
- Varun is comfortable with 50% of bets not working out.
The key tension: users are right about today’s pain points, but the company needs to invest engineering resources in a longer-term vision that users can’t yet articulate.

Breakthroughs that made Cascade work

Three things came together:
1. Models got better—this was essential and not something they could control.
2. Retrieval improved—enabling the agent to work effectively on large, complex codebases, not just greenfield projects.
3. Fast, reliable code editing—they built systems to take high-level plans and execute code changes quickly, which changed internal developer behavior.
The combination of speed, codebase understanding, and model capability created a workflow developers actually wanted to use.

Dogfooding Windsurf

They have an “insiders developer mode” that lets any engineer deploy features to the whole company for immediate feedback.
They use a tiered release system: internal builds, a “next” channel for raw experimental features, and the production release (where A/B tests must not create a “comically bad” experience).
Surprisingly, one of their biggest internal power users is a non-developer in partnerships who builds internal apps (like quoting tools) that would otherwise cost six figures in SaaS subscriptions.
This suggests a new category: domain experts who aren’t developers building custom internal tools for stateless, non-business-critical workflows.

Which SaaS products will and won’t be replaced

Complex SaaS platforms like Workday and Salesforce won’t be replaced—they encapsulate business workflows, compliance requirements, and critical state.
Simple, stateless business tools are vulnerable: companies can now build custom versions cheaply.
Varun sees a proliferation of software that previously couldn’t justify its own business model—tools that are useful but not complex enough to sustain a standalone company.
This could create new opportunities for developers who specialize in building and maintaining these internal tools.

How AI changes the ROI equation for software

Varun pushes back on predictions that AI will reduce the number of software engineers:
- When the cost of building software drops, the ROI of each developer goes up, so companies should build more, not less.
- Technology increases the ceiling of what a company can achieve.
- The demand for software is rising, and expectations for quality are higher—companies that cut engineering headcount will fall behind competitors who leverage AI to move faster.
He draws an analogy to writers: after initial layoffs, companies are now hiring writers again because AI-assisted writing with a skilled human outperforms AI alone.

How engineering work has changed at Windsurf

Engineers are more fearless about jumping into unfamiliar parts of the codebase.
Developers now consult AI first before making changes, whereas in the autocomplete era they would start typing and get suggestions passively.
The active AI (Cascade) has shifted behavior more than passive autocomplete did.

The mental fatigue of software engineering

Varun highlights how software engineering is uniquely mentally taxing: engineers carry incomplete problems home, lose sleep over failing tests, and experience a cumulative fatigue from constant context-switching and problem-solving.
AI tools can reduce the “activation energy” required for tedious but necessary tasks (finding the right dependency, setting up boilerplate, debugging configuration), which makes developers more willing to take on new work.
He argues that great engineers are fundamentally problem solvers who can distill business needs into technical solutions—a skill that will remain in demand even as AI handles more implementation.

Forking VS Code and supporting JetBrains

Windsurf forked Code OSS (the open-source base of VS Code), not VS Code itself, and deliberately avoided Microsoft’s proprietary extensions.
This forced them to build their own implementations of common needs (Python language servers, remote SSH, dev containers), which tightened their product.
They chose to fork rather than build from scratch because developers have established workflows and extension ecosystems they don’t want to reinvent.
For JetBrains, they built a plugin instead of forking because JetBrains IDEs have excellent language servers and debuggers that are best in class for Java and other languages.
They built a shared binary called a language server that does the heavy lifting for both the VS Code fork and the JetBrains plugin, avoiding code duplication and enabling support for many IDEs with minimal additional effort.

Varun’s take on MCP (Model Context Protocol)

MCP is exciting because it democratizes access to internal services within the coding environment, and companies can implement their own MCP servers with security guarantees.
His concern: MCP may be too free-form. Current models are capable of zero-shot integration with services like Notion, but MCP servers often constrain access arbitrarily.
He sees a tension between giving models enough access to be productive and implementing proper granularity (e.g., allowing access to one database table but not others).
He draws an analogy to C#‘s access modifiers (public, private, protected, internal)—primitives that evolved over decades to manage complexity in large codebases. MCP may need similar concepts.
He thinks the industry is still figuring out the right model, and post-MCP-server engineering (access controls, user context, service understanding) will be critical.

Vibe coding and the last 30% problem

Non-developers can get 70% of the way with AI tools but often get stuck on the last 30%, especially when the AI enters a degenerate state making a series of nonsensical changes.
Developers can debug by examining the code and reverting to a working state; non-developers feel helpless.
Varun doesn’t think the solution is to cater the product entirely to non-developers, but principles from helping them (better error recovery, more autonomous debugging) benefit all users.
He believes some level of code literacy will remain valuable—the ability to “peel back layers of abstraction” to understand what’s happening at the OS, networking, or database level is what makes great developers great.
He expects a proliferation of software built by non-developers for simple use cases, while complex systems will still require engineers who can reason about the full stack.

The future of software engineering jobs

Varun is less scared of predictions like “90% of code will be AI-generated” because developers do more than write code—they solve problems, collaborate, and design systems.
He thinks AI will let engineers focus on the parts they enjoy and eliminate tedious details that cause mental fatigue.
For worried engineers at B2B SaaS companies: he believes employers who cut headcount for short-term efficiency are being shortsighted, because competitors using AI to build better products faster will win in a competitive market.
The demand for software is increasing, and expectations are rising—companies need more, better software, not less engineering talent.

Rapid fire

Endurance sports while working hard: He reduced his cycling since starting the company, but previously biked 150+ miles per week by using an indoor trainer (Zwift) at home to minimize friction, and doing long rides on weekends. His advice: lower the barrier to entry by making the activity as convenient as possible.
Book recommendation: The Idea Factory by Jon Gertner, about Bell Labs and how it balanced commercial goals with groundbreaking innovation (information theory, the transistor). Varun finds it inspiring for thinking about how research organizations can drive both scientific and business value.

Summary

How Windsurf evaluates new models

Windsurf’s origin story

The current scale of Windsurf

Why they built their own models

Fine-tuning on company codebases

The scaling challenge: growing codebases and context windows

Infrastructure and latency optimization

Codebase indexing and retrieval

Balancing present and long-term: “the split brain situation”

Breakthroughs that made Cascade work

Dogfooding Windsurf

Which SaaS products will and won’t be replaced

How AI changes the ROI equation for software

How engineering work has changed at Windsurf

The mental fatigue of software engineering

Forking VS Code and supporting JetBrains

Varun’s take on MCP (Model Context Protocol)

Vibe coding and the last 30% problem

The future of software engineering jobs

Rapid fire