The Infrastructure Company Powering the Top AI Apps — Unsupervised Learning

Turbopuffer is a high-performance vector database and search engine built on object storage (S3/GCS), powering AI applications at companies like Cursor, Notion, and Linear. CEO Simon Eskildsen, a decade-long Shopify infrastructure veteran, explains why the explosion of AI workloads demands a new search paradigm, how Turbopuffer’s architecture works, and what problems remain unsolved in vector search.

Context windows are not enough: Even with million-token context windows, companies need to connect tens or hundreds of millions of tokens with permissions, high recall, and low latency—stuffing everything into context is economically and technically infeasible.
SCRAP framework for evaluating when you need a dedicated search database:
- Scale: Datasets will always outgrow context windows; even an AGI-level model would build an index rather than scan everything.
- Cost: Storing data in VRAM is extremely expensive (~$5/GB); object storage is orders of magnitude cheaper.
- Recall: Models struggle with needle-in-the-haystack tasks over very large corpora; there are few good benchmarks for reasoning over book-scale datasets.
- ACLs: No one trusts LLMs to enforce document-level permissions reliably; pre-filtering via a database is the current standard.
- Performance: Loading a massive context window makes sub-second response times difficult, which is what users expect.

Three infrastructure primitives converged to make this architecture viable:
1. NVMe SSDs (available ~2016–2017): Deliver near-DRAM bandwidth at 1/100th the cost, but require new storage engine designs (e.g., io_uring, bypassing the Linux page cache).
2. S3 strong consistency (December 2020): Guarantees that a read immediately after a write returns what was written—critical for database correctness.
3. Compare-and-swap on S3 (late 2024): Enables distributed metadata coordination without a separate ZooKeeper/Raft layer; GCP had this earlier, which is why Turbopuffer started there.
Core trade-off: Writes commit to S3 with ~100–200ms P90 latency. This is unacceptable for transactional systems (e.g., Shopify checkout) but fine for search engines where slight staleness is tolerable.
Upside: Extremely low storage cost, high durability, and operational simplicity—Turbopuffer’s only stateful dependency is object storage, so nodes can be destroyed and rebuilt without data loss.

Cursor: Indexes entire codebases (sometimes multiple, very large ones) into encrypted vectors, enabling semantic search over code via RAG—e.g., “what function formats a number?”
Notion: Powers Q&A over internal wikis and documents, handling the gap between how users phrase queries and how content is written (e.g., “red dress” → “burgundy skirt”).
Linear: Used for search and similarity—detecting duplicate issues or routing issues to the right team members.
General pattern: These are applications where the data is too large, too permission-sensitive, or too latency-sensitive to stuff into a context window.

Incremental index maintenance: Exhaustive nearest-neighbor search is O(n), so approximate nearest neighbor (ANN) indexes are used. Maintaining high recall (~95% is the customer sweet spot) as data is continuously updated is extremely hard. Turbopuffer rebuilt entire indexes early on but now incrementally maintains ANN indexes into the hundreds of millions or billions of vectors per shard.
Filtering: Combining vector similarity with hard filters (e.g., “ships to Canada”) is difficult because filters can eliminate all nearby vectors in a cluster, forcing the query planner to be vector-aware and filter-aware simultaneously.
Sharding: Sharding too early is a coping mechanism; larger shards are more efficient, so pushing single-shard scale as far as possible is a priority.

Simplicity as a core value: Turbopuffer keeps all data—including metadata—on object storage, avoiding separate metadata layers. This was initially forced by customer growth pressure but proved correct.
Reliability first: Simon spent nearly a decade on Shopify’s on-call pager and is deeply motivated to avoid waking customers up at 3am. Every design decision prioritizes operational simplicity and durability.
Focus on the core problem: Customers’ hardest problem is storing and searching petabytes of data, not choosing embedding models or running re-rankers. Turbopuffer stays focused on search performance and cost, though it advises customers on embedding model selection (favoring fast models that don’t add 300ms latency).
Gradual expansion: The team maintains a “grab bag” of ideas and practices disciplined timing on when to pull things in, resisting the pressure to bundle end-to-end RAI solutions prematurely.

Memory in AI agents: Currently ranges from simple text file compaction within a session to lateral memory across sessions (e.g., ChatGPT memories). Some applications like Pollena involve very long conversation histories that blur the line between memory and search over prior context.
Turbopuffer’s role: Can serve as a simple KV store on object storage for memory use cases, whether via vector search or keyword search. It’s still TBD whether memory will require large-scale RAG or remain small enough to handle in-context.
Table stakes for SaaS AI features:
1. Semantic search (including cross-system, e.g., Linear + Slack)
2. Similarity/deduplication/recommendations
3. Report generation over data (deep research)
4. Agentic workflows (which build on 1–3)
Multimodal: Not yet widely adopted by customers, but Turbopuffer supports it. The economics of object storage make it feasible to embed images, PDFs, and other attachments without worrying about usage-based pricing shocks.

Most interesting company to run AI at: A frontier lab (OpenAI, Anthropic) to see models 3–6 months ahead.
Name origin: “Turbopuffer” made Simon happy, sounded funny, and had an unused emoji. The pufferfish metaphor (deflated = on object storage, expanded = in DRAM) was a happy accident.
Biggest mind change: Continues to be surprised that the simple architecture keeps working.
Biggest mistake: Hard to identify—early customers used every feature, and there’s survivorship bias from getting the product right.
Lesson as a founder: Trust your instincts over advice from VCs and others; the team’s collective intuition about timing has been reliable.
Question for the future: How much will agents rely on search engines (vs. context) to complete tasks?
LLMs and learning to code: Simon mourns that he didn’t have an LLM at age 11 when learning PHP; his daughter’s generation will have vastly more accessible learning tools.

Summary