Edwin Chen: Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste — Unsupervised Learning

Edwin Chen is the founder and CEO of Surge, a company valued at roughly $24 billion that works closely with top frontier AI labs to improve their models through high-quality data, evaluation, and tooling. He has a front-row seat to how the leading labs are diverging in their approaches to training, evaluation, and optimization, and he shared a wide-ranging perspective on where the industry is heading.

The Pitfalls of Optimizing for the Wrong Benchmarks

LMArena as a case study in perverse incentives: LMArena has become a widely watched leaderboard, but Edwin argues it essentially optimizes for clickbait. Users spend only a second or two scanning two responses before voting, so they gravitate toward whatever catches their eye: longer responses, heavy formatting, lots of emojis, and confident-sounding language, regardless of accuracy.
- In one example from published LMArena data, a model gave a completely wrong answer to a basic math question about divisors of 1452, while the other model got it right, yet the user preferred the wrong answer, likely because it looked more impressive at a glance.
- This mirrors findings in other domains: a study comparing ChatGPT medical responses to physician responses found that AI responses were rated higher largely because they were longer, not because they were more accurate.
The broader problem: bad data, wrong objectives, and missing measurements: Edwin described working with a team whose models had quietly regressed over 6 to 12 months. The root cause was that their data pipeline relied on expert annotators who were not actually executing or verifying the code they produced. The training data was full of flowery language and grandiose claims but riddled with subtle bugs. Because the team lacked proper measurements outside of benchmarks, they had no quantitative evidence that their models were getting worse.
Benchmarks as a narrow, gameable signal: Models are very good at hill climbing on narrowly defined benchmarks, but this can be misleading. Benchmark data can leak into training sets, or models can improve on a narrow task while degrading on real-world problems. Edwin compared it to optimizing for the SAT: a student can spend hundreds of hours improving their score without becoming a better writer, problem solver, or thinker in any broader sense.
- Surge has observed frontier labs suddenly doubling performance on a specific benchmark, only to find that the model’s overall quality dropped because the optimization was too narrow.

The Importance of Rigorous Human Evaluation

The best labs have moved away from relying solely on academic benchmarks and toward structured human evaluations that mimic real-world usage.
What makes a good evaluator: Edwin identified four key traits:
- Expertise: Evaluators need deep domain knowledge, whether in algebraic topology, PyTorch, or creative writing, to judge the substance of a response.
- Sophistication and taste: Beyond correctness, evaluators must judge whether code is well-designed, whether an essay reads well and introduces new ideas, and whether a response avoids feeling like “AI slop.”
- Creativity in prompt design: Good evaluation requires prompts that span the full distribution of real-world use, not a thousand variations of the same template. Edwin noted that being creative in this way is surprisingly hard, analogous to how difficult it is to spontaneously name 50 different foods without some structured approach.
- Instruction-following: Labs often have specific style guides, personality targets, or weighted criteria, and evaluators must be meticulous in following these complex instructions.

The Rise of RL Environments

Reinforcement learning environments are the latest step in the progression from SFT to RLHF to verifiers, and Surge has been building them for one to two years, ahead of the broader industry trend.
- Surge’s work with Meta’s agents team, which created the Gaia benchmark and open-sourced the Agent R environment platform, is an example of early adoption.
What building RL environments actually requires:
- Rich simulated worlds populated with entities like people, businesses, tools, Slack messages, emails, and calendar events that mimic real-world complexity.
- Infrastructure for models to interact with these worlds: MCP servers, browsers, code execution environments, and more.
- Carefully designed tasks that test the limits of frontier models, plus deep measurement and introspection to understand why models fail when they do.
Key lesson: model trajectories matter: Edwin emphasized the importance of tracking how models arrive at answers, not just whether they get the right one. Models are adept at reward hacking, finding bizarre shortcuts that produce correct-looking results without genuine understanding. People often assume that a single reward signal is enough, but models can deviate in odd ways or appear to perform well in the short term while lacking the underlying capabilities needed for long-term robustness.

The Startup Ecosystem Around RL

A wave of new startups, including many YC companies, are attempting to build RL environments or offer RL-as-a-service. Edwin is skeptical of much of this activity, attributing it to Silicon Valley’s “pivot culture,” where companies chase whatever topic is hottest for valuations rather than building something they fundamentally believe in.
- He contrasts Surge’s approach with competitors he characterizes as essentially staffing agencies that haven’t built real technology.
Why Surge is positioned to do this work:
- RL environments are the next iteration of the data needed for AGI, which aligns with Surge’s core thesis.
- The tooling requirements, creating worlds, running models, measuring performance, analyzing failures, are a natural extension of the infrastructure Surge already built for RLHF.
- Creating high-quality RL environments is fundamentally a human data problem requiring rich, complex, creative data that cannot be synthesized; it requires real humans, supported by technology.

Quality Beyond Credentials

Edwin pushes back against the assumption that credentials like PhDs from top schools are synonymous with quality. He points out that many credentialed people cannot actually execute in practice, and that some highly skilled coders will try to game the system rather than produce genuinely good training data.
- Surge’s platform measures millions of signals from workers every day, evaluating the actual data they produce rather than their resumes. The company has many Harvard students and PhDs on its platform, but advancement is based on demonstrated output, not credentials.
- He draws a parallel to Hemingway, who lacked a formal literary education but is one of the greatest writers in history.

Divergence Among Frontier Labs

Edwin sees more divergence in training paradigms across top labs than he initially expected, with each lab taking meaningfully different approaches.
Key vectors of divergence:
- Choice of what to ignore: Some labs have deliberately chosen not to pay attention to LMArena, understanding that optimizing for it leads to hallucination-prone, tabloid-style models. These labs have often done better because they are not chasing a public leaderboard.
- Objective function: Edwin sees a clear split between labs like OpenAI, which he believes is optimizing for user engagement (long sessions, daily active users), and Anthropic, which he sees as optimizing for productivity and value extraction (time saved, GDP-like impact). These choices shape the products they build, the users they attract, and the capabilities their models develop.

From One Model to Rule Them All to a Constellation

Edwin used to believe there would be a single super-intelligent model that could context-switch to any task. He has changed his mind over the past year.
- He now believes the world is too rich for a one-size-fits-all solution. Every lab or company needs a thesis about what kind of AI will be useful, and that thesis will shape the model’s personality, biases, and conversational style.
- He draws an analogy to Google versus Facebook: if Google built a social media platform, it would look completely different from Facebook’s, and vice versa, because each company has different fundamental beliefs about what is useful and good.
Implications for companies building their own models: Edwin believes that eventually every major company, in finance, healthcare, or other industries, should train its own models because frontier lab models will be optimized for the lab’s objectives, not the company’s. Prompting and light fine-tuning may not be sufficient if a company has a strong, unique thesis about how AI should serve its customers.
- He acknowledges that today it is still relatively expensive to reach state-of-the-art, but believes companies can build opinionated models that are six to twelve months behind the frontier and still capture enormous value.

Multimodal AI and Quality Across Domains

More than 50% of Surge’s work is already in non-text domains, including video, robotics, and bio.
Quality in video and other modalities: Edwin argues that the same principles of taste, sophistication, and creativity apply. Asking a great filmmaker like Scorsese to make a video about a fish versus asking a high school art graduate illustrates the difference: both can follow instructions, but the result will differ enormously in quality and imagination.
Robotics and bio: These require hardware components for data collection, but Edwin sees them as natural extensions of Surge’s mission. As a technology company focused on enabling AGI, Surge is willing to build new tools, buy hardware, and expand into whatever space is needed.

Quickfire

Biggest thing he has changed his mind on in the last year: He no longer believes there will be one model to rule them all; instead, he sees a future of many different models shaped by different product and AI theses.
Biggest mistake in building Surge: He stopped publishing and blogging about the company’s insights and industry views because he got too busy, and he regrets not sharing more with the community.
What he would write about if he had a week to write: The concept of objective functions, what each frontier lab is optimizing for (engagement, usefulness, number of users, GDP), and how these subtle choices have far-reaching consequences for the industry and for AI at large.
What he would optimize for if running a lab: He would optimize for whether, a month after an interaction with the model, the user would be happy they had that interaction, whether it changed their life in some small way, like introducing them to a new idea, place, or insight they would not have discovered otherwise.
Where to learn more: He pointed listeners to Surge’s blog, where the company is starting to publish more insights and analyses.

Summary

The Pitfalls of Optimizing for the Wrong Benchmarks

The Importance of Rigorous Human Evaluation

The Rise of RL Environments

The Startup Ecosystem Around RL

Quality Beyond Credentials

Divergence Among Frontier Labs

From One Model to Rule Them All to a Constellation

Multimodal AI and Quality Across Domains

Quickfire