Linus Lee, an AI engineer at Notion, discusses how Notion rapidly shipped AI features into production, including AI Writer, AI Autofill, and the newly launched Q&A feature. He shares insights on team structure, product development philosophy, evaluation challenges, model partnerships, and user behavior patterns.
Notion AI Product Suite
AI Writer (launched February 2023): The first Notion AI feature, born from a hackathon where co-founders Ivan and Simon experimented with GPT-3. It helps users summarize pages, extract action items, fix spelling/grammar, improve writing style, and draft marketing copy. A key insight was that AI and humans work together iteratively, so conversational follow-up prompts (e.g., “make it shorter” or “make it more punchy”) were added.
AI Autofill (launched May 2023): Brings AI to Notion databases (relational tables). Users can auto-fill entire columns—for example, extracting key topics from meeting notes or pulling core customer needs from interview transcripts. It launched alongside Notion’s broader project management suite.
Notion Q&A (launched December 2023): A chat-based feature that understands everything in a user’s workspace and can answer questions across multiple pages. It was motivated by the information-finding problem that grows as organizations and workspaces scale, combined with the broader AI community’s progress on retrieval-augmented generation (RAG).
Team Structure and Development Philosophy
The AI team at Notion has about a dozen people, roughly split between those focused on data/model quality (correctness, coherence of outputs) and those focused on product concerns (interface, integration into Notion). A couple of designers collaborate closely, especially during ideation.
Development follows an “expand and contract” cycle: start with broad problem statements (e.g., “help people find information”), prototype multiple approaches quickly, dogfood internally, then converge on the most promising direction for a polished ship.
In the early phase, speed of iteration matters most. The Q&A prototype was rolled out internally in an intentionally annoying way—triggering on every question asked in the company—which forced rapid quality improvements because everyone was confronted with the outputs daily.
Once a solution is better defined, the process resembles traditional product building with user research, design collaboration, and structured iteration.
Ownership varies by feature: the core AI team owns AI Writer and Q&A (since they have distinct AI surfaces), while the databases team owns Autofill (deeply integrated with databases), with tight collaboration from the AI team on hard problems.
Notion is still figuring out the long-term organizational model—whether to keep a centralized AI “hub” team, embed AI engineers in every product team, or some hybrid. Linus believes there will always be value in a centralized team handling monitoring, quality assurance, data management, and training infrastructure, similar to how DevOps evolved.
Foundational AI Architecture
Retrieval is increasingly seen as a foundational capability, not just a feature-specific tool. Getting retrieval right augments every AI capability—even those that don’t output natural language to the user, such as surfacing related information or recent document changes.
This suggests a horizontal platform approach where one retrieval system serves multiple AI features, rather than building retrieval independently for each product.
User Behavior and Education
Notion dogfoods heavily, but internal usage isn’t fully representative—Notion’s own team uses every feature extensively, while external users range from students to individuals to teams, with widely varying patterns.
Early testers and “Notion ambassadors” often discover unanticipated use cases. For example, after Autofill launched, international users heavily used the custom prompt field for translation, which Notion then added as a built-in prompt.
Notion follows a pattern of offering both pre-built, engineered prompts and fully customizable prompts. The custom prompts help discover key use cases, which are then productized into pre-built templates over time.
Usage follows a power law: summarization, grammar/writing improvement, and translation are the most popular pre-built use cases. A huge portion of total token usage comes from users iterating on model outputs using the revision prompt.
Power users typically start with pre-built prompts, get inspired, then iterate toward custom prompts they reuse repeatedly (e.g., a newsletter writer who built a template to generate social media meta images from newsletter issues).
The “blank canvas” problem is real: users struggle to know what to ask AI. Pre-built prompts and suggested next actions (e.g., offering revision options after a draft is generated) significantly lower the barrier.
Challenges in Building Q&A
Evaluation is the hardest problem. Unlike writing (where many outputs are acceptable), Q&A is more black-and-white—there’s a specific correct answer, and the model can fail in many ways (wrong document, wrong answer, hallucination).
Edge cases are abundant and hard to anticipate. Users ask meta-questions about Notion itself (e.g., “how do I share this page?”), temporal questions (e.g., “what is the marketing team working on this week?”—no single document contains this), and questions that require synthesizing updates over time.
Building high-quality evaluation sets for each category of edge case, defining grading criteria, and finding ways to remedy failures was a massive undertaking with many sub-components.
Operational questions around privacy, security, and scale lacked clear industry answers, requiring Notion to reason from first principles and go directly to customers.
Evaluation and Tooling
Notion built the vast majority of its LLM evaluation and development tools in-house. This was partly because off-the-shelf tools didn’t exist when they started, and partly because Notion documents are structurally complex (rich text, tables, images, columns, metadata like tasks and due dates) and don’t map cleanly to plain text.
In-house tools allow faster iteration—for example, adding a third model column to a comparison view takes minutes rather than waiting for a vendor.
Evaluation exists on a spectrum:
Low-cost, high-frequency: Deterministic programmatic checks and model-graded evaluations.
Mid-cost: Human annotators for specific tasks.
High-cost, high-value: ML engineers manually examining model outputs to understand why models fail (e.g., confusion about relative vs. absolute dates, instruction-following failures).
The most valuable insights come from engineers sitting down with actual outputs and failure cases, not just aggregate numeric scores.
Notion iterates on the full stack—embedding, ranking, retrieval, prompt engineering, and answer generation—depending on where the pipeline breaks down.
Model Strategy: Partnerships with Anthropic and OpenAI
Notion has strong partnerships with both Anthropic and OpenAI. It does not compete at the infrastructure or foundational model level, recognizing that companies like Anthropic, OpenAI, and Google are far ahead in model training and hosting at scale.
Notion’s role is understanding its specific tasks deeply: defining what good outputs look like, building evaluation datasets, and generating synthetic data that reflects the archetypal structure of Notion workspaces.
Notion has committed to not training on customer data, which pushes them to be more systematic about synthetic data generation and understanding workspace archetypes.
Model selection is feature-specific, based on capability, throughput needs, cost, and provider. For example, Autofill (which runs batch in the background when pages change) requires a model that supports high throughput, while Q&A may prioritize answer quality.
Notion has explored open source models for some use cases (e.g., embedding models) but has not shipped any open source models to production yet.
Prompt Engineering and Model Switching
Prompts are downstream of evaluation criteria, which are downstream of task understanding. When the task is well-understood (e.g., what makes a good summary for a meeting note vs. a bug report), the core instructions transfer well across models.
Per-model tweaks are often minor (e.g., telling a model not to say “in this document” or using all caps for emphasis). The bulk of the work is defining task criteria and output format.
Large models generally transfer well across languages, though they perform best in English. Notion prototypes in English, then builds multilingual evaluation sets and adds few-shot examples or training to bolster non-English performance.
Q&A has a cool cross-lingual capability: a user can ask “what sales did the Japan team make this week?” and the model will read Japanese documents and translate the answer back to English without an explicit translation layer.
Interface Design Philosophy
Notion’s approach to AI interfaces mirrors its general product philosophy: give users the right level of abstraction. Users can always drop down to near-raw prompts (for power users), but the default experience should guide them.
The “blocks for AI” question is still open—what are the right abstractions? For writing, free-form custom prompts work well. For generative UI (models outputting interactive elements), a component library or UI language would be needed so outputs are coherent with the product’s design.
Notion is exploring generative UI and models that can perform actions (similar to Adept’s approach with domain-specific languages for browser actions), but hasn’t committed to a specific direction.
Over-hyped and Under-hyped in AI
Over-hyped: Context length. Linus struggles to imagine truly useful tasks requiring 50,000+ words of context. Packing more data into context introduces noise, and filtering/retrieval remains essential. Most useful tasks can likely be handled with 16k–32k tokens. Training long-context models is also challenging because naturally occurring datasets that require full-context utilization are rare.
Under-hyped: Alternative architectures to Transformers. Transformers are excellent at modeling long sequences efficiently, but nothing about the architecture is provably optimal. Linus believes a new architecture could emerge that is equally good at sequence modeling and more efficient. State-space models are one area of interest. Previous architectures like LSTMs were terrible at using long context, but now that we understand why Transformers work well, we’re better positioned to find alternatives.
Biggest Surprise: Power of General Approaches
Linus has been surprised by how well general-purpose approaches work across multiple tasks. Rather than building separate prompts and models for each task (writing, Q&A, onboarding help), there are long-term payoffs in training a single model on all tasks—it develops a better overall understanding of the product domain.
Similarly, for generative UI, giving the AI more control over what appears on the screen is a harder ML problem but likely more rewarding long-term than hardcoding interfaces powered by AI.
Given how quickly models are improving, betting on generality is a strong strategy.
Where Else Would He Work on AI
Midjourney is the company Linus would most want to work at if not at Notion. He admires their lab-like structure where individuals champion specific ideas and push prototypes forward independently—a “garden” approach that yields productive outcomes because individuals iterate much faster than teams.
Adept is also interesting for tackling the ambitious problem of a general agent that can perform any computer task.
Debrief and Additional Insights
Notion’s heavy dogfooding culture is a major advantage for AI development—internal usage provides rapid feedback on quality and utility, and having a large, engaged user base accelerates learning about unexpected use cases.
Notion built its own evaluation tooling because off-the-shelf solutions didn’t handle Notion’s complex document structures, and in-house tools allow faster iteration. The highest-value evaluation practice is engineers manually examining failure cases to understand root causes.
Notion outsources model infrastructure entirely to Anthropic and OpenAI, focusing its own engineering on task understanding, data, and evaluation. This makes sense for an application company rather than an infrastructure company.
The rapid cost declines from foundation model providers (e.g., OpenAI’s 3x price cuts) make it hard to predict what’s cost-prohibitive today—the advice is to build anyway since costs will likely continue falling.
Notion AI is considered one of the most compelling consumer-facing LLM applications, alongside GitHub Copilot.