David Luan: DeepSeek’s Significance, What’s Next for Agents & Lessons from OpenAI — Unsupervised Learning

David Luan is a veteran AI leader who was VP of Engineering at OpenAI during its formative breakthroughs, co-founded and led Adept (an AI agent startup that raised over $400M), and now heads the AGI Lab at Amazon. He and host Jacob Effron have been friends for over a decade, and this conversation covers DeepSeek’s significance, the future of AI agents, lessons from OpenAI’s culture, and what’s required to reach AGI.

DeepSeek: What People Got Right and Wrong

The market initially overreacted to DeepSeek R1, with Nvidia’s stock crashing and narratives framing it as bad for OpenAI and Anthropic.
David’s immediate reaction was that people missed the memo: making intelligence cheaper doesn’t reduce consumption — it increases it.
The market eventually corrected back to this more rational view.
DeepSeek’s base model appears to have been trained partly on outputs from OpenAI models (distillation), raising questions about whether labs will become less public with their most capable models going forward.
David expects labs to increasingly train massive “teacher” models on huge compute, then distill them internally into efficient versions for customers, rather than releasing the largest models publicly.

The Path to AGI: Combining LLMs with RL

David has been articulating a recipe for AGI since at least 2020: pure next-token prediction alone won’t solve AGI because LLMs are penalized for discovering new knowledge not in their training data.
The key insight is combining LLMs (which encode what humanity already knows) with RL/search (which can discover new knowledge), as demonstrated by AlphaGo.
Pure RL alone fails because starting from scratch — re-discovering language, coordination, etc. — would take forever.
This philosophy has been validated by the success of models combining both paradigms, like DeepSeek R1 and OpenAI’s o1.
For domains that aren’t easily verifiable (healthcare, law), David believes these models generalize better than people think.
The fundamental arbitrage is that models are better at judging whether they did a good job than at generating the right answer — RL exploits this by forcing iteration until the model satisfies its own internal verifier.

Building AI Labs: From Alchemy to Industrialization

One of the most underappreciated breakthroughs is building organizations and processes that can reliably train models — shifting from “alchemy” to industrialization.
A modern AI lab’s job isn’t to build models; it’s to build a factory that reliably turns out models.
Engineering — massive reliable clusters, fault tolerance, efficient compute — has been as important as algorithms in driving progress.
The next frontier involves distributed data centers doing inference and RL on customer-specific environments, sending learned improvements back to a central model.

The State of AI Agents

David remains extremely excited about agents, viewing them as the critical gap between LLMs’ demonstrated promise and actual utility.
At Adept, the team had to train their own multimodal models from scratch because nothing adequate existed — likened to a 2000-era AI startup having to manufacture its own chips.
Out of the box, LLMs are “behavioral cloners” — they do what they’ve seen in training data and generalize poorly to novel situations, making them unreliable for real tasks.
Early use cases like invoice processing revealed that even a 1-in-7 failure rate (e.g., deleting QuickBooks entries) makes a product unusable.
Current agents like Operator and Claude’s computer use are impressive but still have low end-to-end reliability for complex multi-step tasks.

How to Build a Large Action Model

Step 1 (Engineering): Expose to the model, in a legible way, what it can do — available APIs, UI elements, and domain-specific knowledge (e.g., how Expedia or SAP works).
Step 2 (Research): Teach the model to plan, reason, replan, follow instructions, and infer user intent. This is fundamentally different from standard LLM work because it involves multi-step decision-making, backtracking, and predicting consequences of actions.
The training recipe mirrors a textbook: pre-training is exposition, supervised fine-tuning is sample problems, and RL is the open-ended problems at the back.

Interfaces and Human-AI Interaction

David is frustrated by the lack of creativity in how people interact with increasingly capable AI — chat is a low-bandwidth, limiting interface.
He envisions agents that synthesize multimodal user interfaces on the fly to best elicit what they need from users, creating shared context rather than turn-based conversation.
The interaction model should shift from “perpendicular” (chatting at each other) to “parallel” (working together on a screen).
Future interfaces will include command line, GUI, voice, ambient computing, and AI-generated UIs — the metric that matters is leverage per unit of human energy spent.

When Will Agents Be Reliable?

David compares the current moment to the 2005 DARPA Grand Challenge for self-driving cars — impressive demos, but decades away from full deployment at the time.
He believes agents won’t take nearly as long because the right tools now exist.
His key milestone: a recipe where he can give an agent any task during training, come back days later, and it performs at 100% reliability — not incremental improvement, but mastery.

Startups vs. Labs in the Agent Space

David has significant uncertainty but believes AGI (defined as a model that can do anything useful a human does on a computer, and learn it as fast as a human) is not far away.
However, diffusion through society will lag significantly due to a “capability overhang” — the bottleneck becomes people, processes, social acceptance, and co-designing interfaces, not the technology itself.
He believes startups will play a crucial role in bridging the gap between raw model capabilities and what end users actually want, because owning customer relationships and understanding needs matters more than controlling the underlying model.

Specialized Models: Policy, Not Technology

David expects specialized models (e.g., for legal or financial domains) to exist primarily for policy reasons — data separation requirements between companies or divisions (e.g., investment banking vs. sales and trading at a bank) — rather than technical necessity.

Scaling and Remaining Technical Challenges

David doesn’t believe simply scaling today’s approach with more compute will automatically solve everything.
His confidence comes from assessing remaining open problems and judging them to be solvable — nothing on the order of “replace gradient descent” or “require quantum computing” seems necessary.
He pays attention to alternative architectures primarily for how they map model learning to compute efficiency.
He spends significant time on data centers and chips, finding the hardware landscape fascinating.

Data Labeling in the Test-Time Compute Era

Data labeling still serves two important roles: (1) teaching models the basics of how to do a task via behavioral cloning, and (2) teaching models what good and bad look like for fuzzy tasks.
The middle ground — spamming human labels to marginally improve models that can already do a task — will increasingly be handled by RL.

What David Changed His Mind On

Team culture: He’s become even more convinced that hiring smart, energetic, intrinsically motivated people early in their careers is one of the best engines for progress, because the optimal playbook changes every few years and people overfit to the previous playbook slow you down.
Compounding technical differentiation: He used to believe that being best at text modeling would compound into winning at multimodal, reasoning, and agents. In practice, he’s seen very little compounding — labs are pursuing relatively similar ideas, and being first at one breakthrough doesn’t deterministically lead to winning the next.

Robotics

David believes digital agents serve as “training wheels” for physical agents — solving reliability in simulation or digital space transfers to the physical world.
He’s skeptical of both extremes: neither the “scaling laws will solve everything immediately” view nor the “we’re stuck in 1995” view.
His confidence comes from the ability to build training recipes that achieve 100% task performance in digital space, which should transfer to physical space over time.
World models and video generation are exciting because they address the problem of learning when you don’t have an explicit verifier or simulator.

OpenAI Culture: What Made It Special

David joined OpenAI in 2017 when it was ~35 people, tasked with blurring the lines between research and engineering.
The key cultural insight was recognizing that the era of “you and your three friends write a paper that changes the world” was over — solving major scientific goals required bigger combined teams of researchers and engineers, regardless of whether solutions were academically “novel.”
This sometimes drew criticism (e.g., GPT-2 “just” being a Transformer), but they took pride in executing well-known ideas at scale.
The team was full of incredibly motivated people early in their careers without PhDs or decades of experience — people like Alec Radford and the inventor of DALL-E — driven by intrinsic motivation and intellectual flexibility.
He shares an anecdote of a researcher so focused on experiments that he never set up Wi-Fi or electricity in his apartment, spending all his time at the office.

Google and the GPT Breakthrough

Credit goes to Ilya Sutskever, who recognized the Transformer paper’s importance early and pushed people to try experiments with it across architectures.
David implies it was difficult for Google to coalesce as a full organization around this breakthrough, whereas smaller, more focused teams could move faster.

Nvidia’s Underappreciated Strengths

Beyond the well-known bets on GPUs and CUDA, David highlights Nvidia’s decision to bring interconnect in-house and orient the entire business around systems (not just chips) as hugely important moves that have paid off enormously.

Amazon’s Role in AGI

Amazon is serious about building generally intelligent agents, with leadership understanding that the fundamental compute primitive is shifting from traditional operations to calls to large models and agents.
David and Peter Abel have launched a new San Francisco-based research lab focused on making the remaining research breakthroughs needed for AGI.
The breadth of agent applications across a company as large as Amazon is a unique advantage.

Evaluations and Benchmarks

David looks at new models by examining methodology — when a simpler approach yields better results, it usually becomes part of the deep learning canon.
He’s skeptical of public benchmarks, which have become gamed and misaligned with what people actually need.
He believes measurement and evaluation deserve far more prestige and attention than they currently receive.

Quickfire

Model progress this year vs. last: Visibly about the same, but actually more.
Overhyped: “Scale is dead” narratives.
Underhyped: Solving extremely large-scale simulation for models to learn from.
Where to follow David’s work: The Amazon SF Lab, and he plans to return to Twitter (@dluan).

Summary

DeepSeek: What People Got Right and Wrong

The Path to AGI: Combining LLMs with RL

Building AI Labs: From Alchemy to Industrialization

The State of AI Agents

How to Build a Large Action Model

Interfaces and Human-AI Interaction

When Will Agents Be Reliable?

Startups vs. Labs in the Agent Space

Specialized Models: Policy, Not Technology

Scaling and Remaining Technical Challenges

Data Labeling in the Test-Time Compute Era

What David Changed His Mind On

Robotics

OpenAI Culture: What Made It Special

Google and the GPT Breakthrough

Nvidia’s Underappreciated Strengths

Amazon’s Role in AGI

Evaluations and Benchmarks

Quickfire