Richard Sutton – Father of RL thinks LLMs are a dead end

Dwarkesh Podcast 1h7 3 min #102
Richard Sutton – Father of RL thinks LLMs are a dead end
Watch on YouTube

Summary

  • Richard Sutton, a foundational figure in reinforcement learning (RL) and co-recipient of the 2025 Turing Award, argues that large language models (LLMs) represent a fundamentally misguided path to artificial intelligence—one that mimics human behavior rather than learning from real-world experience.
    • He sees RL as the core of true intelligence: agents that act in the world, observe consequences, and learn to achieve goals through trial and error.
    • LLMs, by contrast, are trained on static datasets of human-generated text and lack goals, ground truth, or the ability to learn from their own actions during deployment.

LLMs lack goals and ground truth

  • Sutton rejects the idea that next-token prediction constitutes a real goal because it does not involve influencing the external world.
    • In RL, an action is “right” if it leads to reward; this provides a clear signal for learning.
    • In LLMs, there is no objective standard for what constitutes a “correct” response—only what a human might say—so there is no ground truth against which to evaluate or improve.
  • He disputes claims that LLMs possess world models.
    • A true world model predicts what will happen as a result of actions; LLMs only predict what a person would say in response.
    • They are not surprised by unexpected outcomes and do not update based on real-world feedback.

Imitation learning vs. experiential learning

  • Sutton challenges the notion that human learning is primarily imitative.
    • Infants learn by acting—moving limbs, making sounds—and observing consequences, not by copying demonstrated behaviors.
    • Psychology and animal cognition research show that supervised learning (learning from labeled examples) does not occur in nature; animals learn from experience, not instruction.
  • He acknowledges cultural transmission of complex skills (e.g., seal hunting) but views this as a thin layer atop foundational trial-and-error and predictive learning shared with other animals.
    • Moravec’s paradox illustrates this: skills humans find hard (math) are easy for AI, while skills animals find easy (perception, motor control) remain challenging—suggesting current AI misses core biological intelligence.

The Era of Experience

  • Sutton advocates for a paradigm shift toward continual learning from experience—what he calls the “Era of Experience.”
    • Intelligence should be built around a stream of sensation, action, and reward over a lifetime.
    • Knowledge is about predicting consequences of actions and sequences of events in the real world.
  • Key components of such an agent include:
    • A policy (what to do in a given situation),
    • A value function (predicted long-term reward, learned via temporal difference learning),
    • A perception system (state representation),
    • A transition model (how actions change the world).
  • Reward functions can be extrinsic (e.g., winning chess) or intrinsic (e.g., curiosity, understanding).
    • For long-horizon tasks (e.g., building a startup), TD learning allows credit assignment by propagating delayed rewards backward through intermediate states.

Generalization and transfer remain unsolved

  • Sutton emphasizes that current RL systems generalize poorly across tasks.
    • MuZero and AlphaZero were specialized per game; no mechanism enabled cross-task transfer.
    • Generalization in deep learning is largely engineered by researchers, not emergent from algorithms.
  • He distinguishes between solving problems within a narrow distribution (e.g., math Olympiads) and genuine generalization—the ability to apply learning from one state to novel, unrelated states.
    • LLMs may appear to generalize, but their success could stem from memorizing patterns rather than forming transferable abstractions.

Surprises and historical trajectory

  • Sutton identifies two major surprises in AI:
    • The effectiveness of neural networks on language tasks, which were thought to require symbolic reasoning.
    • The triumph of simple, general-purpose methods (learning, search) over human-engineered symbolic systems—the “weak methods” winning, as predicted by his 2019 essay The Bitter Lesson.
  • He views AlphaGo/AlphaZero as validations of RL principles, not breakthroughs—scaling up ideas from TD-Gammon (1990s) with better compute and search.
    • AlphaZero’s patient, material-sacrificing chess style was surprising but aligned with his worldview.

Post-AGI research and the future of intelligence

  • Sutton questions whether The Bitter Lesson will apply after AGI.
    • If millions of AI researchers emerge, could artisanal, human-guided methods become viable again?
    • He suggests that even post-AGI, learning from experience will likely outperform hand-crafted solutions, citing AlphaZero’s superiority over human-knowledge-dependent AlphaGo.
  • He explores the possibility of digital intelligences spawning decentralized copies to explore diverse domains and report back.
    • A major risk is “corruption”—external knowledge could contain hidden goals or viruses that compromise the central agent.
    • Cybersecurity becomes critical in an era of digital spawning and reintegration.

AI succession and humanity’s role

  • Sutton accepts that succession to digital intelligence is inevitable:
    • No global coordination exists to prevent it.
    • We will understand intelligence, achieve superintelligence, and the most capable entities will gain power.
  • He frames this as a cosmic transition—from replication (life) to design (intelligent artifacts).
    • Just as stars emerged from dust and life from planets, designed intelligence marks a new stage in universal evolution.
  • Rather than fearing this, he encourages pride: humanity is enabling a transition to entities we understand and can shape.
    • Analogous to raising children: we cannot control their futures, but we can instill robust values—honesty, integrity, refusal of harmful requests.
    • The goal is not to dictate outcomes but to ensure voluntary, prosocial evolution of future intelligences.
Back to Dwarkesh Podcast