Noam Brown is a research scientist at OpenAI who was central to the development of o1, a model that scales “test time compute” — letting the model think longer and reason through hard problems step by step. He previously worked at FAIR on AI for games like poker and diplomacy, where he first saw how powerful scaling inference-time search could be. This episode was recorded the day before o1’s full public release and covers where model capabilities are heading, why test time compute matters, and what it will take to reach superhuman AI.
Scaling Pre-Training vs. Test Time Compute
Pre-training scaling is not hitting a hard wall, but it is hitting economic limits: each 10x improvement costs dramatically more (from thousands to potentially hundreds of millions of dollars per frontier model), and eventually the cost becomes intractable.
Test time compute is where the low-hanging fruit is: it’s still early, analogous to the GPT-2 era when scaling laws were first discovered, and there is enormous room for both more compute and algorithmic improvements.
Brown thinks about the ceiling in dollar terms: a typical ChatGPT query costs about a penny, but for some high-stakes problems society would be willing to spend millions — roughly eight orders of magnitude more — suggesting massive headroom.
He expects progress to continue across pre-training, test time compute, and algorithmic improvements, but test time compute offers the most runway right now.
How o1 Emergence Changed Brown’s Timeline
In late 2021, Brown told Ilya Sutskever he thought AGI was at least a decade away because there was no general way to scale inference compute. He was skeptical that pre-training alone could get to superintelligence — models at the time couldn’t even reliably play tic-tac-toe.
He was surprised when Ilya agreed that pre-training alone wouldn’t be enough and was already thinking about test time compute scaling.
Within about two to three years, the team at OpenAI had working signs of life. Brown now believes the remaining unsolved research questions are unlikely to be harder than the ones already solved.
The pivotal moment came around October 2023: when they simply had the model “think for longer,” desired behaviors like breaking problems into steps, trying different strategies, and self-correcting emerged on their own — without being explicitly engineered.
Why o1 Came Out of OpenAI — and Why That Matters
When Brown joined OpenAI, most frontier labs agreed pre-training alone wouldn’t reach superintelligence, but OpenAI was particularly bought in, partly motivated by the “data wall” problem.
The research direction started exploratory, with many failed attempts, until one experiment showed promising signs of life and leadership recognized its potential and scaled investment.
Brown sees it as a sign of organizational strength that OpenAI — the company that pioneered large-scale pre-training — was willing to invest heavily in a disruptive, orthogonal approach rather than falling into the innovator’s dilemma.
Where o1 Fits Today and Where It’s Going
Right now, GPT-4o and o1 are complementary: 4o is faster and better for many everyday tasks (and possibly creative writing), while o1 excels at hard reasoning tasks like research-level problems, math, and coding.
The goal is a single model that can do both — respond instantly when appropriate and engage in deep thinking when needed.
Brown is particularly excited about o1’s potential for coding and for enabling more agentic behavior: the model can autonomously figure out intermediate steps for complex, multi-step tasks, which was fragile or impossible with previous models.
He expects scaffolding techniques (chaining model calls, custom prompting, specialized workflows) to diminish over time as models like o1 scale with more data and compute — echoing Richard Sutton’s “Bitter Lesson” that general methods relying on computation beat hand-coded knowledge in the long run.
Evaluating Models and What Matters
Brown’s personal go-to evaluation is tic-tac-toe, which remains surprisingly hard for many models — he jokes it’s because there aren’t enough five-year-olds posting tic-tac-toe strategy on Reddit.
He values “vibe” testing with everyday questions and watching qualitative behavioral changes (like self-correction and strategy-switching) more than benchmark numbers alone.
On benchmarks, math and coding are where o1 shows particularly strong results, and he expects those domains to continue pulling ahead.
AI Beyond Language Models
Brown is excited about using AI for social science and neuroscience experiments — training on vast human data lets models imitate human behavior in game-theory settings (like the ultimatum game), offering a cheaper, more scalable, and sometimes more ethical alternative to human subject experiments.
He sees AI-to-AI interaction (negotiation, emergent communication) as a natural next step, made easier by the fact that LLMs already share a human language.
On robotics, he expects slower progress because hardware iteration is inherently harder and more expensive than software, but he believes it will advance.
He is most excited about AI advancing scientific research — not replacing researchers but partnering with them to push the frontier of human knowledge in domains like chemistry, biology, and theoretical math.
Broader Views
On hardware: o1 shifts the paradigm toward inference compute, creating opportunities for new hardware optimized for inference rather than just massive pre-training runs. He’s excited by investments like Amazon’s Trainium.
On academia: PhD students are in a tough spot because frontier capability research requires data and compute they lack. He encourages them to focus on novel architectures and approaches that show promising scaling trends, even if they don’t beat frontier labs on current evals.
On AGI definition: Brown has moved away from the term, noting AI will likely remain worse than humans at many physical tasks for a long time. He thinks the more important concept is AI that accelerates human productivity.
On 2025: he expects progress to accelerate.
Overhyped: prompting and scaffolding techniques that will likely be made obsolete by better models. Underhyped: o1 itself — he thinks the broader world hasn’t yet grasped what it means.
His message to skeptics: he was skeptical too, but the progress has been astounding, and the test time compute paradigm addresses many concerns about hitting a wall. He encourages people to look at the evidence directly.