Jacob interviews Noam Shazeer (co-inventor of the Transformer, now leading Gemini at Google) and Jack Rae (DeepMind researcher, key figure in Gemini and test-time compute) about the current state of AI, the Gemini 2.0 models, test-time compute, open-source competition, and the future of AGI.
Gemini 2.0 and Test-Time Compute
Gemini 2.0 Flash Thinking was initially focused on reasoning tasks like math and code, but a surprise was how well “thinking” generalized to creative tasks like essay writing, where the model’s internal thought process was interesting to read and the output quality improved.
The team was initially skeptical about the intense focus on math benchmarks, but those benchmarks are important because they resist simple memorization and actually distinguish models that can reason about difficult problems.
Benchmarks saturate quickly: problems considered hard six months ago become trivial, and once evals become public they get leaked into training data, making them useless. There’s a shared responsibility across labs to develop private, meaningful evals.
Reinforcement Loops as the Key Milestone
The most important milestone is when Gemini X can write Gemini X+1 — using AI as a tool to accelerate AI development itself.
Other reinforcement loops include data flywheels (user feedback making models better at what people care about) and the global excitement/funding flywheel.
Within Google’s structured monorepo, AI is already being pulled into development workflows: AI-generated bug fixes and code reviews are being merged, and agentic coding is seen as a major area of progress.
Models work best in easily verifiable domains (math, code). For less verifiable domains, the path forward involves training models to follow broad, abstract criteria for quality and then using reinforcement learning against those reward signals — something that seemed abstract a year ago but is now working.
Multimodal and Agentic Capabilities
Gemini’s image input plus thinking is remarkably strong for visual reasoning. Pairing multimodal understanding with agentic tasks (like browser automation in Mariner, launched December 2024) is a major area of excitement.
For agents to become widely used, both reasoning complexity and reliability need to improve. Part of the challenge is defining the right environments (web UI, codebases) for agents to operate in — this is as important as any algorithmic breakthrough.
Switching to agentic research has non-trivial engineering costs: it’s no longer simple prompt-response but acting in environments, which changes how research is done.
Test-Time Compute and the Path to AGI
Inference is extraordinarily cheap: millions of tokens per dollar, orders of magnitude cheaper than reading a book or paying a human. This creates huge room to apply more compute at inference time through chain-of-thinking and other methods.
Training costs scale quadratically with model size, but inference stays relatively cheap, so applying more compute at inference is the natural next scaling curve.
However, test-time compute alone won’t get to AGI. Acting in complex environments is a separate, essential component that requires dedicated investment.
The aspiration is “deep thinking” models that don’t just think longer to solve a problem, but think deeply enough to generate useful knowledge and dramatically improve data efficiency — like a mathematician who reads one textbook and spends most of their time playing with ideas.
Early signs of models acting like researchers: math is pivoting from benchmark performance to actually generating useful mathematical knowledge. The goal is a gradient of evals from current benchmarks to genuinely novel scientific contributions.
The hardest part is question-posing, not solving. If a model can identify genuinely novel, useful mathematical questions and then solve them, that could be one of the greatest contributions to science — like completing the map of known mathematics.
On “Learning to Mimic People” Critiques
Yan LeCun’s critique that LLMs can only interpolate known ideas and never create novel ones is addressed: even interpolation across disjoint bodies of knowledge (e.g., material science) would dramatically accelerate science.
Noam’s response: rather than arguing about definitions of novelty, just build AI, increase technology, and help people.
AI Research Culture and Breakthroughs
AI research is still like alchemy — highly experimental, with hypotheses formed after trying things out. Credit assignment is complicated and often arbitrary.
The Transformer’s invention involved random happenstance: Noam heard about related work happening nearby, and ideas like “kill the LSTM” and “parallelize but not just a convnet” were in the air. It could easily not have happened when it did.
Google uses a bottoms up compute allocation model; DeepMind was more top down. Both have tradeoffs: top down helps collaboration on large runs, bottom up allows abstraction-breaking ideas that don’t fit categories. The current approach blends both.
A lesson from concentrated bets: you need research leads with vision, because not everyone can see where impact will come from. But bottom-up research consistently humbles top-down planning by producing unexpectedly impactful results.
Shifting Timelines and Open Source
Both have shifted their timelines forward: the rate of progress is much faster now, and paradigm shifts create sudden acceleration.
The speed of information propagation has changed: the Transformer took 6-9 months to spread within Alphabet; the test-time compute paradigm saw multiple labs train and release competitive models within months of announcement.
Open-source models are staying competitive with frontier closed-source models. Gemma 3 (released the day before the interview) was described as “completely incredible.” The time gap between closed and open source is shrinking, even if the quality gap persists.
Both are impressed by the passion, creativity, and compute access in the open-source community.
Personal Impact and Education
Noam’s four-year-old son uses Gemini as a personalized encyclopedia — taking pictures of plants and lizards in the garden, learning Latin names, and absorbing detailed information. This represents a type of education that may never have existed before.
Children are sponges for information, and combining that curiosity with AI could be transformative.
Noam has stopped worrying about global warming because he expects AI to solve carbon-related problems.
Both feel that what they do now matters more, since human physical labor may not be necessary in the future — making it more important to do meaningful work now.
AGI Risks
Moderately worried: it’s hard to find examples of creating something far more intelligent than its creator that still acts predictably and usefully.
Practical concerns include making sure AI is constructive to the economy and avoiding sharp disruptions in employment.
There’s a balance between excitement to ship capable models and having internal safety teams thinking holistically about unintended consequences.
Noam’s Reflections on Character.AI
Noam left Google to start Character.AI because he believed the LLM industry needed an application where anyone could interact with LLMs and discover use cases — a mission largely accomplished now that everyone talks to LLMs.
Character.AI found that many people use it for entertainment, partly because early models hallucinated, and hallucination is a feature in entertainment.
The future of AI companions: people will always want human relationships for spiritual meaning, but AI in human form will likely grow for productivity and interaction. Progress depends both on models getting better and on product decisions about what to let users do.
Quickfire
Overhyped: The ARC AGI eval — progress has been slow because researchers aren’t inspired by synthetic puzzles; they’d rather model natural language, which is more AGI-relevant in the long run.
Underhyped: AGI itself — LLMs are still massively underhyped; people think in terms of trillion-dollar products, but the real impact is far larger.
Most interesting application to build: Agentic applications beyond coding (which is crowded) — things that go beyond chat to actually go out and do useful things. Noam also thinks code is underhyped because humans aren’t naturally good at it, and automated software engineering creates a self-accelerating loop.
Infrastructure for test-time compute vs. pre-training: Inference can be much more distributed than large-batch training, meaning models don’t need strong interconnects between data centers. Actors can gather experience in many locations and send it back. This is intrinsically cheaper and drives prices down. The challenge is that inference loses the parallelism of training — you become memory-bound looking at attention keys/values for each generated token — requiring work on both model architecture and hardware (Google’s co-design link with the TPU team helps here).
Where to try the latest: A new, considerably stronger Flash model with thinking is out on the Gemini app. User feedback is incorporated into each model series.