Richard Sutton – Father of RL thinks LLMs are a dead end — Dwarkesh Podcast

Richard Sutton, a foundational figure in reinforcement learning (RL) and co-recipient of the 2025 Turing Award, argues that large language models (LLMs) represent a fundamentally misguided path to artificial intelligence—one that mimics human behavior rather than learning from real-world experience.
- He sees RL as the core of true intelligence: agents that act in the world, observe consequences, and learn to achieve goals through trial and error.
- LLMs, by contrast, are trained on static datasets of human-generated text and lack goals, ground truth, or the ability to learn from their own actions during deployment.

Sutton rejects the idea that next-token prediction constitutes a real goal because it does not involve influencing the external world.
- In RL, an action is “right” if it leads to reward; this provides a clear signal for learning.
- In LLMs, there is no objective standard for what constitutes a “correct” response—only what a human might say—so there is no ground truth against which to evaluate or improve.
He disputes claims that LLMs possess world models.
- A true world model predicts what will happen as a result of actions; LLMs only predict what a person would say in response.
- They are not surprised by unexpected outcomes and do not update based on real-world feedback.

Sutton challenges the notion that human learning is primarily imitative.
- Infants learn by acting—moving limbs, making sounds—and observing consequences, not by copying demonstrated behaviors.
- Psychology and animal cognition research show that supervised learning (learning from labeled examples) does not occur in nature; animals learn from experience, not instruction.
He acknowledges cultural transmission of complex skills (e.g., seal hunting) but views this as a thin layer atop foundational trial-and-error and predictive learning shared with other animals.
- Moravec’s paradox illustrates this: skills humans find hard (math) are easy for AI, while skills animals find easy (perception, motor control) remain challenging—suggesting current AI misses core biological intelligence.

Sutton advocates for a paradigm shift toward continual learning from experience—what he calls the “Era of Experience.”
- Intelligence should be built around a stream of sensation, action, and reward over a lifetime.
- Knowledge is about predicting consequences of actions and sequences of events in the real world.
Key components of such an agent include:
- A policy (what to do in a given situation),
- A value function (predicted long-term reward, learned via temporal difference learning),
- A perception system (state representation),
- A transition model (how actions change the world).
Reward functions can be extrinsic (e.g., winning chess) or intrinsic (e.g., curiosity, understanding).
- For long-horizon tasks (e.g., building a startup), TD learning allows credit assignment by propagating delayed rewards backward through intermediate states.

Sutton emphasizes that current RL systems generalize poorly across tasks.
- MuZero and AlphaZero were specialized per game; no mechanism enabled cross-task transfer.
- Generalization in deep learning is largely engineered by researchers, not emergent from algorithms.
He distinguishes between solving problems within a narrow distribution (e.g., math Olympiads) and genuine generalization—the ability to apply learning from one state to novel, unrelated states.
- LLMs may appear to generalize, but their success could stem from memorizing patterns rather than forming transferable abstractions.

Sutton identifies two major surprises in AI:
- The effectiveness of neural networks on language tasks, which were thought to require symbolic reasoning.
- The triumph of simple, general-purpose methods (learning, search) over human-engineered symbolic systems—the “weak methods” winning, as predicted by his 2019 essay The Bitter Lesson.
He views AlphaGo/AlphaZero as validations of RL principles, not breakthroughs—scaling up ideas from TD-Gammon (1990s) with better compute and search.
- AlphaZero’s patient, material-sacrificing chess style was surprising but aligned with his worldview.

Sutton questions whether The Bitter Lesson will apply after AGI.
- If millions of AI researchers emerge, could artisanal, human-guided methods become viable again?
- He suggests that even post-AGI, learning from experience will likely outperform hand-crafted solutions, citing AlphaZero’s superiority over human-knowledge-dependent AlphaGo.
He explores the possibility of digital intelligences spawning decentralized copies to explore diverse domains and report back.
- A major risk is “corruption”—external knowledge could contain hidden goals or viruses that compromise the central agent.
- Cybersecurity becomes critical in an era of digital spawning and reintegration.

Sutton accepts that succession to digital intelligence is inevitable:
- No global coordination exists to prevent it.
- We will understand intelligence, achieve superintelligence, and the most capable entities will gain power.
He frames this as a cosmic transition—from replication (life) to design (intelligent artifacts).
- Just as stars emerged from dust and life from planets, designed intelligence marks a new stage in universal evolution.
Rather than fearing this, he encourages pride: humanity is enabling a transition to entities we understand and can shape.
- Analogous to raising children: we cannot control their futures, but we can instill robust values—honesty, integrity, refusal of harmful requests.
- The goal is not to dictate outcomes but to ensure voluntary, prosocial evolution of future intelligences.

Summary