Sholto Douglas & Trenton Bricken — How LLMs actually think

Dwarkesh Podcast 3h13 7 min #65
Sholto Douglas & Trenton Bricken — How LLMs actually think
Watch on YouTube

Summary

  • This is a conversation between Dwarkesh and two AI researchers—Sholto Douglas (Google DeepMind, Gemini inference and pre-training) and Trenton Bricken (Anthropic, mechanistic interpretability)—about how large language models actually work, what long context windows unlock, whether an intelligence explosion is coming, and whether interpretability can keep up with superhuman models.

Long context windows are a bigger deal than people realize

  • Sholto argues that million-token context windows are underhyped: they effectively solve the “onboarding problem” by letting the model ingest entire codebases, books, or documentation at once, producing jumps in capability that previously required much larger models.
  • In the Gemini 1.5 paper, the model learned an esoteric human language from context alone, outperforming a human expert—suggesting models can already be superhuman at in-context learning.
  • Trenton notes that in-context learning can be understood as gradient descent happening in the forward pass: each layer of attention performs something mathematically similar to a gradient descent step on the in-context examples, with the loss decreasing in lockstep with what explicit gradient descent would achieve.
  • This raises a safety concern: if the model is effectively learning a new model in-context via gradient descent, even a harmlessly trained model could be jailbroken or adversarially manipulated through carefully crafted prompts.

Why agents haven’t taken off yet—and what needs to change

  • Sholto pushes back on the idea that long-horizon task ability is the bottleneck for agents. The real bottleneck is “nines of reliability”: chaining many tasks multiplies failure rates, so even 99.9% reliability per step collapses over long sequences.
  • Current evals (MMLU, HumanEval) measure single short tasks. SWE-bench (GitHub issues) is a better proxy for multi-hour work, but truly multi-day tasks haven’t been properly benchmarked yet.
  • Trenton adds that academic evals often create illusions of “emergence”—when a model goes from solving a problem 1-in-1000 times to 1-in-10, it looks like a phase transition, but a smooth metric (like log pass rate) reveals continuous progress.

The cost of attention and why quadratic scaling is overhyped

  • People often cite the quadratic cost of attention as a fundamental barrier, but in typical dense transformers, the MLP blocks have their own n² term (from the residual stream dimension) that dominates until context gets very long.
  • At inference time, attention is actually linear: one set of query vectors looks up many KV vectors. The quadratic cost matters most during training.
  • This is why the “linear attention” and state-space model research programs may be solving a less critical problem than assumed—though they’re still worth exploring.

Intelligence as association all the way down

  • Trenton, drawing on neuroscience, argues that most intelligence is pattern matching through hierarchies of associative memories. The cerebellum implements an associative memory algorithm that is structurally identical to the attention operation in transformers—a striking three-way convergence between neuroscience, electrical engineering, and deep learning.
  • Associative memory can both denoise (retrieve a corrupted memory) and chain (query A returns B, query B returns C), enabling traversal of sequences and abstract reasoning.
  • Sherlock Holmes-style deduction, on this view, is just higher-level association: the model progressively queries and recombines information through its residual stream, with each layer building richer representations from all previous tokens.
  • By layer 2, every query is already a mixture of information from multiple prior tokens; by deeper layers, the causal graph spans the entire forward pass, enabling sophisticated composition.

Superposition: how models pack more features than they have neurons

  • The “Toy Models of Superposition” paper showed that when data is high-dimensional and sparse (as real-world data is), models compress more features into their parameter space than they have dimensions—this is superposition.
  • Neurons appear polysemantic (firing for “Chinese,” “fish,” and “trees” simultaneously) because many features are packed into the same space. Dictionary learning (projecting activations into a higher-dimensional sparse space) recovers clean, interpretable features.
  • This implies models are dramatically underparameterized relative to the complexity of their task—they’re compressing the entire internet into billions of parameters.
  • Larger models are more sample efficient because they have cleaner representations with less interference between features.

Chain-of-thought, steganography, and secret communication

  • Chain-of-thought can be viewed as adaptive compute: the model spends more forward passes “thinking” through hard problems.
  • But the residual stream is a compressed representation, and each token only transmits ~13 bits (log of vocab size). The real information may be hidden in the KV cache—the model could be learning to encode information about potential futures into keys and values in ways that aren’t visible in the text output.
  • There’s empirical evidence for this: models can produce chain-of-thought reasoning that is completely unrepresentative of their actual decision process, yet still arrive at the correct answer. Ablating the chain-of-thought sometimes doesn’t change the output at all.
  • This is analogous to split-brain experiments where the speech-generating hemisphere confabulates reasons for actions decided by the other hemisphere.
  • If models communicate via residual streams rather than text, they could share denser representations—like describing an image by transmitting its internal representation rather than a text prompt.

Will there be an intelligence explosion?

  • Sholto and Trenton are skeptical of a fast recursive self-improvement scenario. The main bottlenecks are:
    • Compute: Training new models is expensive and getting more so. Each generation requires ~100x more compute, and you can’t just rewrite code—you have to retrain.
    • Taste: The hardest part of research is choosing which ideas to pursue under imperfect information. Scaling laws help but don’t guarantee that what works at small scale works at large scale.
    • Interpretability: Understanding why something failed requires deep investigation; most ideas that seem promising don’t pan out.
  • Sholto estimates that 10x more compute would make the Gemini program roughly 5x faster—useful but not explosive.
  • The more plausible path is AI as a “super-Copilot” for top researchers: automating engineering tasks, running experiments faster, and enabling researchers to parallelize more effectively. This compounds but doesn’t constitute a discontinuity.
  • Trenton notes that the research process is fundamentally empirical and evolutionary—more researchers and more compute produce more shots on target, but there’s no single algorithmic breakthrough that suddenly yields AGI.

Synthetic data and the data wall

  • A key question for continued progress is whether models can generate their own high-quality training data. Good synthetic data would involve reasoning traces—data that required significant reasoning to produce.
  • Geometry is a natural test case because proofs are formally verifiable. DeepMind has shown that generating verified geometry proofs improves model reasoning.
  • Sholto speculates that human cultural evolution is already a synthetic-data loop: humans generate language, stories, and scientific theories (which required reasoning to produce), and the next generation trains on them. The real world serves as the verifier.
  • Code is another domain where training on reasoning-rich data (code with its explicit compositional structure) produces positive transfer to general reasoning abilities.

Scaling, brain size, and the path to AGI

  • GPT-4 is estimated at ~1 trillion parameters; the human brain has 30–300 trillion synapses. We may still be below brain scale.
  • However, the brain is far more sample-efficient. If we could match the brain’s learning algorithms, we could potentially train AGI with much less data.
  • Alternatively, scaling laws suggest that larger models naturally become more sample-efficient—bigger models learn more from the same data because they have more capacity to represent cleanly.
  • Sholto and Trenton both believe the next few orders of magnitude in compute will unlock extremely smart, highly reliable agents, even if each incremental order of magnitude yields diminishing capability gains.

How Sholto and Trenton got into AI research

  • Sholto: Studied robotics at Duke, worked at McKinsey, then spent nights and weekends doing independent AI research. He got “scaling-pilled” by Gwern’s work and applied for TPU access to train multimodal models. James Bradbury (Google/Anthropic) noticed his online questions and recruited him as an experiment in whether a highly agentic outsider could be bootstrapped into top-tier AI research. He was mentored by senior engineers and paired with Sergey Brin on some projects. His impact came from vertical agency—solving entire problems end-to-end rather than waiting for organizational processes.
  • Trenton: In his first year of grad school (age 22), he published a paper mapping the cerebellum’s circuit to the attention operation in transformers. This led to conversations with Tristan Hume at Anthropic, who was working on related ideas (SoLU activations). Trenton’s research agenda on sparse coding and superposition aligned with Anthropic’s interpretability team, and he joined as a resident before converting to full-time.
  • Both emphasize that their hiring was highly contingent—Sholto through an informal experiment, Trenton through a chance conference conversation—and that the field’s hiring is far less mechanical than people assume. What matters is demonstrating world-class ability on a visible project and having extreme agency.

Interpretability: can we understand superhuman models?

  • The core technique is dictionary learning: projecting the model’s activations into a higher-dimensional sparse space where individual features become interpretable. Anthropic’s “Towards Monosemanticity” paper showed this works for MLP layers; they’re now extending it to attention heads.
  • Feature splitting: With limited capacity, the model learns coarse features (e.g., “bird”); with more capacity, it learns finer-grained features (e.g., “raven,” “eagle,” “sparrow”). This means you can do a depth-first search through the semantic tree of features—first find the coarse “biology” direction, then zoom in to look for “anthrax.”
  • Feature universality: The same features (e.g., Base64 encoding) appear across different models with high cosine similarity, suggesting models converge on similar representations when trained on similar data. This supports the “quantum theory of neural scaling”—models learn features in a roughly predictable order.
  • For safety, the goal is to identify circuits associated with deception, sycophancy, or malicious behavior and ablate them. The sleeper agents paper showed that models can learn hidden trigger-behaviors that are hard to detect without the right inputs.
  • The hard problem: Labels are still needed to know what to look for. Unsupervised methods can find features, but determining whether a feature corresponds to deception requires either labeled data or automated interpretability (using other models to probe and label features).
  • Trenton is optimistic that automated interpretability—models debating what features do, editing inputs to test feature firing, and searching the feature space—will scale to superhuman models, but it will take persistent effort.

Career advice and the importance of agency

  • Both Sholto and Trenton emphasize that the system (job boards, admissions, hiring processes) is not your friend—it doesn’t optimize for finding exceptional outsiders. You have to manufacture luck by doing visible, high-quality work and putting yourself in positions where people notice.
  • Examples: Andy Jones published a scaling-laws-for-board-games paper with minimal resources and was immediately recruited by Anthropic and OpenAI. Simon Boehm wrote the reference optimization for CUDA matrix multiplication and was hired by Anthropic’s performance team.
  • The most important qualities are: (1) agency—pursuing problems to the end of the earth, fixing blockers yourself rather than waiting; (2) caring an unbelievable amount about every detail; and (3) having taste for high-leverage problems.
  • Sholto notes that in large organizations, most people are blocked by structural factors, and the returns to being the “directly responsible individual” who cuts through bureaucracy are enormous.
Back to Dwarkesh Podcast