What rebuilding AlphaGo teaches us about self-play, RL, and future of LLMs - Eric Jang — Dwarkesh Podcast

Eric Jang, former VP of AI at 1X Technologies and senior research scientist at Google DeepMind Robotics, spent his sabbatical rebuilding AlphaGo from scratch to understand how deep learning solved a problem long considered intractable for search. The project, which once required millions of dollars and a full DeepMind team, can now be done for a few thousand dollars of rented compute thanks to LLM coding assistance and modern GPUs. The conversation covers how Go works, how AlphaGo uses Monte Carlo Tree Search (MCTS) and neural networks to make the game tractable, why its self-play training is so sample-efficient compared to modern LLM RL, and what this tells us about the future of AI research and automated science.

How Go works

Go is a two-player territory game played on a 19×19 grid where black and white stones are placed on intersections, with the goal of surrounding more territory than your opponent.
- Stones are captured when all four orthogonal neighbors (not diagonals) are occupied by the opponent, cutting off “oxygen.”
- The game ends when both players pass consecutively or one resigns.
- Tromp-Taylor rules are the unambiguous, algorithmic rule set used by all Go AIs, differing slightly from human rules (e.g., allowing “suicide” moves that immediately resolve to death).
- Scoring under Tromp-Taylor counts controlled stones plus empty intersections not touched by the opponent’s stones, which can differ significantly from human intuition about who controls an area.
- A key strategic feature: you can lose a local battle (sacrifice stones) to win the war (gain positional advantage elsewhere), and this micro-macro tradeoff becomes richer as board size increases.

Why Go is computationally hard

The game tree for Go is astronomically large: roughly 361 possible first moves, decreasing by one each turn, with games lasting 250–300 moves, giving a branching factor that makes exhaustive search impossible (on the order of 361³⁰⁰, far more than atoms in the universe).
- There is no local reward signal—you don’t know who’s winning until the very end of the game.
- This is why computer scientists long believed Go was intractable this century.
- AlphaGo’s core breakthrough was using neural networks to make this search problem tractable by shrinking both the breadth and depth of the tree.

Monte Carlo Tree Search (MCTS)

MCTS is the search algorithm that decides which moves to play, iteratively building a sparse tree rather than exhaustively enumerating all possibilities.
- Each node in the tree stores: a visit count (how often the parent visited this child), a mean action value Q(a) (average outcome from this node), a prior probability P(a) of taking this action, and references to child nodes.
- The tree is rebuilt from scratch on every move—after the opponent moves, the AI starts a new search from the new root node.
- Action selection uses PUCT (Predicted Upper Confidence with Trees): argmax of Q(s,a) + CPUCT × P(a) × √N / (1 + Nₐ), balancing exploitation (high Q) with exploration (low visit count Nₐ).
  - The exploration term rewards trying actions you haven’t tried before; as visit counts grow, the Q term dominates.
  - This is derived from the UCB1 bandit algorithm but adapted for sequential games with large action spaces.
- The four-step simulation loop: Selection (walk down the tree using PUCT), Expansion (add a new leaf node when you reach an unvisited state), Evaluation (use the neural network to guess the value of the new node), and Backup (propagate the evaluated values back up the tree by taking running means).
- The final visit counts across the tree become the policy distribution—the AI’s “vote” for which move is best.

The neural network: policy and value heads

AlphaGo uses a single neural network with two heads: a value network (predicts probability of winning from a given board state) and a policy network (predicts a distribution over good moves).
- The board is encoded as a multi-channel image (black stones, white stones, empty intersections) and fed into a backbone network.
- Architecture choice (ResNet vs. Transformer) matters less than expected; ResNets tend to outperform Transformers at low budgets because their local convolution inductive bias captures local patterns well, though Transformers can better connect global features across the board.
- The value network acts as a shortcut: instead of searching to the end of the tree, the network “glances” at the board and predicts who wins, amortating thousands of possible playouts into a single forward pass.
- The policy network prunes the breadth of search by focusing exploration on promising moves rather than uniform random sampling.
- In the original AlphaGo Lee, the network was initialized with supervised learning on expert human games, which already produced a strong “shoot from the hip” player without any search.

Self-play training: MCTS as an improvement operator

The key insight: MCTS produces a better policy (π_MCTS) than the raw network policy (π_θ), and you can train the network to imitate the search result, effectively distilling search into the network.
- On every move, the network makes an initial guess, MCTS refines it through hundreds to thousands of simulations, and the refined distribution becomes the training target.
- The network is trained to minimize KL divergence between its policy output and the MCTS visit count distribution, and to predict the game outcome (win/loss) for the value head.
- This creates a virtuous cycle: a better network → better MCTS → better training targets → an even better network.
- Test-time scaling: more simulations improve win rate, but with diminishing returns. Distillation shifts the curve upward—the distilled network starts at a higher baseline, so the same number of simulations reaches a stronger policy.
- This is analogous to the DAgger algorithm in robotics: relabeling actions in a trajectory with better ones, even if the original trajectory was losing.
- MCTS is not guaranteed to be better than the raw policy—if the value function is inaccurate or simulations are too few, the search can give worse recommendations. Grounding the value function with real endgame playouts or diverse training data is critical.

Why MCTS-based RL is more sample-efficient than policy gradient methods

In modern LLM RL (REINFORCE-style), you reinforce entire trajectories based on a single binary win/loss signal, which is extremely high-variance—most moves in winning games are neutral or even slightly bad, and you only get one learning signal per ~300-move game.
- Karpathy’s analogy: “sucking supervision through a straw.”
- The variance of the gradient grows quadratically with trajectory length T, making long-horizon RL especially inefficient.
- AlphaGo avoids this by getting a supervision signal for every single move in every game, regardless of outcome: MCTS says “given this state, here’s a better action,” giving dense, low-variance labels.
- This is why AlphaGo can hill-climb from the very beginning of training, always operating in a regime where the learning signal is informative, rather than spending most of training at near-zero pass rate.
- Bits-per-sample analysis: supervised learning with cross-entropy loss on a 100K vocabulary gives negative log(p) bits of information, which is highest when the model is most wrong. RL with binary success/failure gives at most 1 bit (the entropy of a coin flip), and almost zero bits when the pass rate is very low—which is exactly where untrained models spend most of their time.

Alternative RL approaches and connections

Neural Fictitious Self-Play (NFSP): Used in imperfect-information games like StarCraft where you can’t easily build a search tree. Instead of MCTS, you fix an opponent and train a best-response policy against it using model-free RL (PPO, SAC, etc.), then distill that best response into a mixed strategy. This is a proxy for the MCTS teacher when forward search isn’t feasible.
Q-learning connection: Q-learning propagates value estimates backward through TD learning (Q(s,a) = r + γ max Q(s’,a’)), which is structurally similar to MCTS backups. Both exploit the recursive structure of MDPs, but MCTS plans over trajectories the agent hasn’t visited yet (forward search), while Q-learning plans over visited trajectories (backward learning).
Why MCTS doesn’t directly apply to LLMs: Language has an enormous, effectively continuous action space (vocabulary of 100K+ tokens), making discrete action-selection heuristics like PUCT inappropriate. There’s no reliable way to locally evaluate whether a partial reasoning chain is “good” without solving the whole problem. However, domains with more rigid logical structure (like mathematics) may be more amenable to tree search.

Off-policy training and replay buffers

AlphaGo uses a replay buffer of states from various policy iterations, relabeled with MCTS—this is off-policy training, which can be unstable if the buffer contains states the current policy would never visit.
- The DAgger perspective explains why some off-policy data is useful: you want states your policy would visit, plus a tube of nearby states it might drift to, with labels that funnel it back to optimal behavior.
- Too much off-policy data (states the policy can never reach) wastes model capacity on irrelevant states.
- An experiment: instead of playing full games, randomly sample board states from a dataset and relabel them with MCTS. This works moderately well and resembles robotics setups with offline data and Bellman updaters, but risks training on unreachable states.
- Modern RL has converged toward on-policy methods for stability, using off-policy data mainly for variance reduction (advantage estimation) rather than direct optimization.

Compute efficiency and scaling

AlphaGo Zero used ~3×10²³ FLOPS, comparable to a frontier LLM, but Eric reproduced strong results for ~$10K of rented compute using LLM coding assistance and modern Blackwell GPUs.
- Being the first to do something always costs more compute than catching up; subsequent efforts can use distillation, best-response training against existing strong bots, and other crutches.
- Many of KataGo’s clever algorithmic tricks (auxiliary supervision objectives, complex architectures) become less necessary with faster GPUs and good initialization.
- Training on smaller boards (9×9) first and transferring to 19×19 significantly cuts warm-start time by resolving endgame value functions cheaply.
- Scaling laws in board games (Andy Jones, 2021) show you can trade off test-time compute (more MCTS simulations) against training compute, anticipating the inference-scaling paradigm now seen in LLMs.
- Eric’s attempt to apply scaling laws from the start failed because you need a working, bug-free system before you can meaningfully study scaling—scientific understanding follows engineering success.

Automated AI research

Eric used LLM coding assistants (Opus 4.6/4.7) extensively for this project, giving early indications of what automated research might look like.
- What works well: Hyperparameter optimization over open-ended search spaces, executing experiments end-to-end (describe what you want plotted, and the LLM runs experiments, compiles results, and suggests causes), and squeezing performance out of fixed compute budgets.
- What doesn’t work well: Selecting the next experiment in a research track, lateral thinking about whether a track is worth pursuing, and distinguishing bugs from fundamentally wrong ideas. Current models lack the high-level belief in what “should work” that helps human researchers persevere through bugs.
- Go as an RL environment for training automated scientists: it captures many research-relevant problems (distributed systems, scaling, architecture search) with a fast, verifiable outer loop (win rate against KataGo). Skills developed here might transfer to biosciences, robotics, or automating AI research itself.
- The inner-loop verification problem: it’s hard to know whether a modification is an improvement or a degradation without a working system. The outer-loop problem: even with a verifiable game, knowing which research direction matters (e.g., discovering scaling laws vs. producing another random plot) is genuinely difficult.
- Parallelizability of AI innovation: many compute multipliers don’t stack well because they provide correlated, transitory benefits. As GPUs get stronger, yesterday’s clever tricks matter less. Having a single top-down vision of how things should work remains important.

Broader implications

AlphaGo’s most profound lesson: a 10-layer neural network can amortize a nearly intractable search problem into a single forward pass, compressing what feels like NP-class complexity into a small amount of compute. This challenges our understanding of computational hardness—these problems are worst-case hard but have rich structure that neural networks exploit.
- The connection to chaos theory: Go is chaotic (sensitive to initial conditions), but macroscopic quantities (who’s winning) are predictable, similar to how weather is chaotic at the microscale but hurricane tracks are forecastable. Neural networks may be especially good at capturing these macroscopic structures.
- Cryptographic hash functions are also chaos-sensitive but lack macroscopic structure, which may explain why neural networks can’t approximate them—there’s no “value function” for hashing.
- The duality between thinking and search: MCTS is explicit forward search, while LLMs may implicitly learn something like it. Whether tree search makes a comeback for LLM reasoning depends on finding domains with enough structure to support local evaluation and discrete action selection.

Summary

How Go works

Why Go is computationally hard

Monte Carlo Tree Search (MCTS)

The neural network: policy and value heads

Self-play training: MCTS as an improvement operator

Why MCTS-based RL is more sample-efficient than policy gradient methods

Alternative RL approaches and connections

Off-policy training and replay buffers

Compute efficiency and scaling

Automated AI research

Broader implications