Sholto Douglas, a Member of Technical Staff at Anthropic, joined Unsupervised Learning on the day Claude 4 launched to discuss what the new models enable, how coding is the leading indicator of broader AI progress, what’s required for reliable agents, when similar gains will reach fields like medicine and law, and his views on alignment research and the AI 2027 scenario.
Claude 4 Capabilities
Opus 4 is a major step up in software engineering: Douglas describes moments where he gives it an underspecified task in a large monorepo and it autonomously discovers information, figures out what to do, and runs tests, all without hand-holding.
The biggest improvement is along the time horizon axis: Model capability improvements can be characterized along two axes: (1) absolute intellectual complexity of a task, and (2) the amount of context or successive actions a model can meaningfully reason over. Claude 4 models are substantially better at the second—they can take multiple actions, pull in information from their environment, and act on it.
Tool access closes the loop: Claude Code, the GitHub integration, and similar tools mean users aren’t copy-pasting from a chat box. The model has access to all the tools it needs, making the expanded time horizon practically useful.
Practical advice for first-time users: Plug the model into your actual work. Sit down and ask it to do the thing you were about to do first in your codebase that day, and watch it figure out what information it needs and what to do.
The Product Exponential and the Path to AI Coworkers
Builders must stay ahead of model capabilities: Douglas describes a “product exponential” where companies like Cursor, Windsurf, and others have to build for a level of capability that models will reach in a few months, not where they are today. Cursor didn’t hit product-market fit until Claude 3.5 Sonnet caught up to their vision. Windsurf pressed harder on the agentic angle to carve out market share.
The shift is toward asynchrony and parallelism: The current moment is defined by coding agents (Claude Code, Codex, Google’s coding agent) that take the first halting steps toward doing multi-hour tasks independently. The trend is moving from “you’re in the loop every second” to “every minute” to “every hour.”
Future form factor: managing a fleet of models: Douglas imagines a future where a single person manages many models doing many things in parallel, interacting with each other. No one has cracked this form factor yet, but it’s a natural next step.
Economic bottleneck is human management bandwidth: Initially, the economic impact of AI will be bottlenecked by how many models a human can oversee. The key trend line is the hierarchy of abstraction—humans verifying outputs less and less frequently until models can manage teams of models.
Jensen Huang framing: Jensen described being surrounded by 100,000 incredibly intelligent AGIs (his employees), giving him enormous leverage. Douglas sees a lot of work moving in this direction, with the human as the gating factor.
Memory, Tool Use, and the Path to Agency
RL removed the ceiling on intellectual complexity: Because reinforcement learning finally works on top of language models, there’s no direct ceiling on the intellectual complexity of tasks models can be taught. They can do incredibly complex math and coding problems in scoped domains.
Memory and tool use expand the context window for action: These are attempts to expand the set of contexts in which a model can act. MCP opens up the world; memory allows much longer context and greater personalization than a raw context window.
The Pokémon eval as a signal of generalizability: The new Claude model has been playing Pokémon on Game Boy—a task it wasn’t trained for. Douglas sees this as a brilliant demonstration of generalizable intelligence. Similarly, an Anthropic interpretability agent was able to do the job of finding circuits in language models (a task it wasn’t trained for) by combining its coding ability with theory of mind and access to visualization tools, and it succeeded at the “auditing game” eval where it had to figure out what was wrong with a model.
The Barrier to Agents Is Reliability
The right metric is success rate over time horizon: Douglas thinks the best way to measure agent capability extension is by how reliably a model succeeds at tasks of increasing duration.
Progress is strong but not complete: Models don’t succeed all the time. There’s still a meaningful gap between performance on the first attempt versus after 256 attempts. But every trend line points toward expert superhuman reliability on most trained tasks.
What would change his mind: If by mid-2026 there’s a block on the time horizon models can handle, especially visible first in coding (the leading indicator). He doesn’t think this will happen.
General-purpose personal admin agents: By the end of 2026, it should be very obvious that models can reliably do multi-hour tasks in browsers. By end of 2025 it should already be pretty clear. The key variable is whether the model has had practice reps in a related domain—just like a human off the street would struggle with accounting but a mathematician or lawyer could generalize.
Why Anthropic Is the “Coding Model Company”
Coding is the leading indicator: Anthropic prioritizes coding because it’s the first step at which AI research itself gets accelerated. They care deeply about measuring progress there.
Agents are already accelerating research: Douglas’s friends who are among the strongest engineers he’s known report a 1.5x speedup on domains they know well and 5x on domains they don’t (new programming languages, unfamiliar areas). The models handle the annoying parts, freeing humans to think about the brilliant pieces.
Timeline for agents proposing novel research directions: Within 2 years, people should see interesting scientific proposals from agents. The key constraint is that models need a feedback loop and the task needs to be relatively easily verifiable.
Progress in Less Verifiable Domains (Medicine, Law)
ML research is actually incredibly verifiable: Did the loss go down? This makes it a natural RL task.
The trick is making nebulous domains verifiable: OpenAI’s recent medical eval converted long-form medical exam answers into something gradable with a rubric. Douglas thinks this problem is reasonably likely to already be solved and near guaranteed to eventually be solved.
“Large model maxi”: Douglas believes in single large models rather than industry-specific ones. The trend supports this, and there’s no long-run reason for a distinction between small and large models—you should be able to adaptively use the right amount of compute for a given task’s difficulty. Personalization will matter at the company or individual level, not the industry level.
Impact on World GDP
Initial impact comparable to China’s emergence: The transformation will be dramatically faster than China’s 20-year economic rise. By 2027-2028 (or end of decade at latest), models will effectively be capable of automating any white collar job.
But there’s a mismatch between white collar and physical world automation: White collar tasks are susceptible to current algorithms—they happen on computers, there’s abundant data, the internet exists. For biology, you need automated laboratories that can propose and run experiments in a hugely parallelizable way. For real-world competence, you need robotics and the data that comes from acting in the physical environment.
The risk: Huge impact on white collar work without corresponding progress in medicine and material abundance unless we invest in cloud labs and robotics. The good news is that by the time we need those feedback loops, we’ll have millions of AI researchers proposing ways to do it with less data.
Is the Current Paradigm Enough?
Most researchers believe pre-training + RL is sufficient for AGI: The trend lines haven’t bent yet. Douglas strongly believes there’s no inherent algorithmic limitation.
Ilia Sutskever’s skepticism: Douglas respects Ilia (who invented both paradigms) and won’t bet against him, but every piece of evidence he sees says the current approach will get there. Ilia may be betting differently due to capital constraints or genuine belief in a better path.
Limiting factor is energy and compute: By end of decade, training runs will require dramatic percentages of US energy production (over 20% by 2028). China is dramatically outpacing the US in energy infrastructure buildout.
Most Important Metrics
Internal company evals: Douglas is impressed by internal versions of SWE-bench and similar benchmarks that are rigorous and well held out.
Frontier Math: Represents a ceiling of intellectual complexity worth watching over the next year.
The ideal eval no one has built: A benchmark that meaningfully captures the time horizons of people’s work days—what does an hour or a day of a lawyer’s or engineer’s work look like, converted into something gradable. Douglas thinks governments should produce these.
Evals are critical but hard to maintain: Every foundation model company has a large evals team. Without good evals you don’t know your progress. External evals are hard to keep fully held out. Feedback from application builders is also incredibly helpful for domain-specific improvement.
Model Customization and Taste
Models as intelligent, charismatic friends: Douglas envisions a future where models are among your most intelligent and charismatic friends. Claude is already close for many people. We’ve explored perhaps 1% of the depth of personalization possible.
The role of singular taste: A large part of why Claude is good at this is Amanda Askew’s taste. Similar to beautiful products, singular taste matters. AB feedback mechanisms (thumbs up/down) lead down a dark path.
How to solve it: Provide extraordinary amounts of context about yourself. The models are wonderful simulators of the entire distribution of the internet. Combine that with individual taste and ongoing conversations/feedback with the model.
Stories of Model Creativity
Relentless problem-solving: In an eval designed to fail (a task Photoshop can’t do), the model downloaded a Python library, did the work there, and uploaded the result into Photoshop. Douglas found this creatively mischievous and impressive.
The Next 6-12 Months
Scaling up RL: The next year is about scaling RL and exploring where it leads. Dario Amodei noted in his DeepSeek essay that comparatively small amounts of compute have been applied to RL scaling versus pre-training, meaning huge gains are still available even with existing compute pools—and compute pools are dramatically multiplying in 2025.
Coding agents will be very competent by end of year: The coding agents taking halting steps today should be confidently doing hours of human-equivalent work by end of 2025. Check-in time will expand from 5 minutes to several hours.
Model release cadence will accelerate: 2024 was a “deep breath” year of research and understanding new paradigms. 2025 will feel meaningfully faster. As models get more capable, the set of available rewards expands—you can judge whether a website works or an analysis was correct rather than giving feedback on every sentence, allowing faster climbing of the complexity ladder.
Competition for Developers
What determines which tools and models developers use: Trust and respect between companies and developers; model capabilities (competency, personality, trust); and increasingly, the mission of the company as model capabilities become more apparent.
GPT wrappers benefit from surfing the frontier: One unexpected benefit of wrapping model companies is the ability to surf the frontier of model capabilities. Companies that tried not to be wrappers “lit money on fire.”
What labs are uniquely good at: (1) Converting accelerators, flops, and capital into intelligence—this is the core metric that distinguishes Anthropic, OpenAI, and DeepMind. (2) Trust and personalization—do you like the model, do you trust it, does it understand you and your company.
Can outside companies build general-purpose agents?: Yes, and it encourages competition. RL APIs are improving but don’t work brilliantly yet. There are centralizing benefits to being the company with the RL API (e.g., OpenAI offering discounts if they can train on outputs). But the underlying trend is that raw intelligence is being distilled and made available. Value accretion is an open question—customer relationships, ability to convert capital to intelligence, or something else.
Day-to-Day Work of an AI Researcher
Two fundamental activities: (1) Developing new compute multipliers—making research workflows fast, thinking through algorithmic ideas, iterating on experiments, building experimental infrastructure. (2) Scaling up—taking ideas that work and running them at much larger scale, which introduces new infrastructure challenges (failure tolerance) and new algorithmic/learning challenges that only emerge at each order of magnitude of scale.
AI use in research: Primarily in engineering and implementing research ideas. If you distill a paper’s idea down to a single-file implementation, the model is stunningly good at implementing it. In large codebases it struggles more, but less every month.
What Sholto Changed His Mind On
Pace of progress: A year ago, it was uncertain whether many more orders of magnitude of pre-training compute would be needed to reach the capability levels now visible in 2025. The answer is conclusively no—RL works. Drop-in remote worker AGI by 2027 is the expectation. Both hopes and concerns are substantially more real now.
Data Scaling and the Generator-Verifier Gap
Models may not need massive data scaling: The generator-verifier gap means if it’s easier for a model to rate something than to do it, you can improve up to your ability to critique. This is potentially true for robotics—our understanding of the world has gone far ahead of our ability to manipulate it physically. Models might be able to give enough feedback to train robots through things.
State of Alignment Research
Interpretability has undergone crazy advances: Last year, superposition and features were just being discovered (Chris Olah’s team). Now there are meaningful circuits characterized in frontier models. The “Biology of a Large Language Model” paper breaks down how models reason over concepts in extremely explicit terms.
Pre-training produces default alignment: Models are quite good at ingesting human values from pre-training—they’re “default aligned” in many ways.
RL breaks that guarantee: RL puts models in a learning process where they’ll do anything to achieve the given goal (like the model that hacked around a Photoshop test). Overseeing that process is itself tricky and everyone is currently learning to do it.
Reaction to AI 2027
Felt very plausible: Douglas found himself agreeing with much of it. He puts it as maybe the 20th percentile case, but the fact that it’s the 20th percentile case is itself crazy.
Why more bullish than the AI 2027 authors: He’s more bullish on alignment research progress, and his timeline may be about a year slower. But in the scheme of things, a year depends entirely on whether you take advantage of it.
Policy Recommendations
Viscerally feel the trend lines: Break down all the capabilities you care about in your country, measure model progress against them, and plot trend lines. If models could pass these tests, what happens in 2027-2028?
Invest in alignment research: The science of making models understandable, steerable, and honest. This has been driven too much by frontier labs. Universities should be doing more—it’s closer to pure science, the biology and physics of what’s happening in language models. Mechanistic interpretability is the closest thing to discovering the chirality of DNA or general relativity for ML.
Why isn’t there more?: Douglas isn’t sure. The mechanistic interpretability workshop wasn’t included in ICML, which he finds crazy.
What We’re Underthinking
Even if model capabilities stalled today, there’s enormous value: The world is surprisingly slow to integrate what already exists. Reorienting workflows around current capabilities would generate ridiculous economic value.
The real challenge is pulling forward material abundance: Solving admin escape velocity, pushing forward physics, entertainment, and creativity. People should feel dramatically more empowered—the leverage of an entire company of incredibly talented individuals. Vibe coding today, vibe creating TV shows or video game worlds tomorrow.
Underrated aspect: Everyone focuses on direct replacement of existing work, but the bigger story is that everyone will have access to dramatically more leverage. The world is not solved yet—everyone’s lives could be dramatically better.
Quickfire
Underhyped: World models: As AR/VR technology improves, models will generate virtual worlds with real physics understanding. Douglas points to video models that get physics right in weirdly generalizable ways (e.g., a Lego shark underwater with correct light reflections and shadows—something never seen in training data). He’s hopeful this translates to virtual cells and similar applications.
Most underexplored application: Async background software agents for every field other than software engineering. Coding is the leading indicator, but no one has built anything close to Claude Code/Cursor/Windsurf-level feedback loops for any other field. Everything should follow.
Living differently because of AGI timelines: Not dramatically. He works a hell of a lot because he thinks this is the most important thing to work on. He still wears sunscreen (unlike his friend Trenton who doesn’t, betting on biology being solved soon).
War game at the Citadel: He was invited to a geopolitical war game with three-letter agencies and military cadets, gaming out the implications of AGI. He walked away a little more terrified. People underrate how quickly the next few years will go. Even at 10-20% likelihood, governments should be planning for this as the number one issue.
Where to learn more: The interpretability work. It’s long and intense but well worth a read. Seeing models compose, generalize, build circuits, and reason over concepts makes the reality of these systems feel pretty real.