Ion Stoica, co-founder of Databricks, Anyscale, and now LMArena, has spent decades shaping modern data and AI infrastructure. In this episode, he explains why he believes AI evaluation and reliability are the most critical unsolved problems holding back AI’s potential, why China is structurally outpacing the US in open-source AI, and where the biggest opportunities lie in AI infrastructure going forward.
LMArena: From Berkeley Research Project to $100M Startup
LMArena originated almost two years ago at Berkeley out of a practical need: evaluating Vicuna, a fine-tuned Llama model trained on ShareGPT data.
Initial evaluation efforts involved students manually comparing model outputs over pizza, which didn’t scale.
The team tried using GPT-4 as a judge (an approach that became known as “LLM as a judge”) just two weeks before GPT-4’s public release. It worked well, but people remained skeptical about whether LLM judgments aligned with human preferences.
This led to the creation of Chatbot Arena, a platform where users submit prompts, receive anonymized answers from two random models, and vote on which is better. The system uses Elo ratings (borrowed from chess and tennis) to produce dynamic, continuously updated leaderboards.
Two key motivations drove the Arena’s design:
Static benchmarks suffer from contamination (models trained on test data), analogous to students taking the same exam repeatedly.
Human preferences matter because most successful AI applications today involve a human in the loop.
The platform grew far beyond text to cover multimodal evaluation (text-to-image, code generation, web generation, search engine evaluation) and introduced style control to factor out biases like verbosity or emoji usage.
A recent feature called Prompt Leaderboard lets users submit a prompt and receive a ranked list of best-performing models for that specific prompt, estimated by finding similar prompts in the existing vote data.
The project became a company because:
It was expensive to run (nearly $2 million in the past year for compute credits and grants).
Frontier labs began requesting pre-release evaluations.
The data unlocked many downstream questions (model swapping impact, personalized recommendations, category-specific rankings).
Scaling required professional backend infrastructure and UI/UX, impossible with a small student team.
Why Human Evaluation Still Matters
Most successful AI applications today have a human in the loop precisely because reliability is hard to guarantee.
Software development spends most of its energy on testing and debugging; AI is harder because models are black boxes.
Human evaluation captures things automated metrics miss, including legitimate preferences (e.g., better formatting, more articulate responses).
LLM judges have their own biases: position bias (preferring the first answer), verbosity bias, poor math skills, and favoritism toward models from their own family.
These biases exist because LLMs are trained on human-generated artifacts, reflecting human tendencies back at us.
The key to making evaluation generalizable is scale: more data enables micro-category analysis and personalized model recommendations.
China’s Structural Advantage in Open-Source AI
Open-source models have caught up remarkably fast, and the best open-source models now come from China, not the US.
Developing frontier models requires three things: experts, data, and infrastructure. China has strong numbers in all three, with infrastructure catching up despite export controls.
The structural difference is collaboration:
In the US, development is siloed across frontier labs, all doing similar work in secret.
Academia in the US is largely locked out of pre-training and model development due to lack of resources. A few efforts exist (AI2, Stanford, Berkeley), but they’re under-resourced.
In China, there’s much stronger academia-industry collaboration (e.g., with Baidu, Alibaba, DeepSeek), enabling broader diffusion of innovation.
The main mechanism of innovation diffusion in the US is people leaving one company for another, which is far slower and less efficient than open collaboration.
Ion is skeptical of existential risk arguments for keeping models closed:
Humans are disproportionately driven by fear, so risk discussions need to be discounted for emotional bias.
Most risks AI enables (deepfakes, bomb-making information) predate AI; AI makes the knowledge-acquisition step easier but doesn’t change the dominant bottlenecks (acquiring materials, assembling undetected, delivering).
He sees no convincing evidence of genuinely marginal risks (new risks that didn’t exist before) enabled by AI.
AI Infrastructure: Opportunities and Challenges
The trend is toward vertically integrated infrastructure, with co-design across all layers from application to hardware.
Key infrastructure challenges:
Hardware heterogeneity: GPUs (Nvidia, AMD), TPUs, Trainium, plus diverse networking (Ethernet, InfiniBand, RDMA, NCCL) make optimization extremely complex.
Automated kernel generation: A major opportunity is automatically generating optimized low-level code for different accelerators.
Fine-grained overlap of communication and computation to maximize GPU utilization.
Seven types of parallelism for model training and serving (model, data, tensor, pipeline, context, token, expert, sequence) make automated optimization essential.
For agents, the challenge is that the field moves so fast that stable frameworks and abstractions are hard to build. Standardization tends to emerge only when the pace of application-level evolution slows.
Some standardization is happening at lower levels: transformers, PyTorch, and the OpenAI API for inference. Post-training frameworks increasingly build on Ray and vLLM.
Reflections on Databricks and the AI Moment
Databricks got two things right after the ChatGPT moment:
Data remains as important as ever: Unified access to data across storage systems with governance (via the Lakehouse architecture and Unity Catalog) is critical for enterprises.
Aggressive pursuit of AI for enterprises after acquiring Mosaic: enterprises have unique, valuable data and need help extracting value from it and building AI-powered products.
AI was in Databricks’ DNA from the start: MLlib was one of the main libraries on top of Spark, and early customers bought Databricks specifically to do machine learning.
In hindsight, Ion questions whether building their own model (DBRX) was the right call given how many powerful open-source models have since been released.
What Ion Changed His Mind On
Expected more competition for Nvidia and hasn’t seen it materialize yet. He’s cautiously hopeful that Huawei, Google (TPUs), AWS (Trainium), and AMD could challenge Nvidia, but software ecosystem lock-in remains the biggest barrier.
Pleasantly surprised by open-source progress, especially from China (didn’t expect it to come from there).
Underestimated quantization: He initially thought users wouldn’t accept the performance trade-off, but quantization has been a game changer for efficiency.
Post-training and reasoning models have been more effective than he expected.
Reliability and hallucinations remain unsolved, especially with reasoning models.
AGI, Measurability, and the Trajectory of Progress
Ion jokes that everyone will claim to be right about AGI because there’s no agreed-upon definition.
Historically, computers have gotten better than humans at an increasing number of tasks (calculators, chess, Go, image recognition), and this will continue.
Progress is fastest where there are clear, measurable outcomes and ground truth: math, coding, games, sciences with formal specifications.
Progress is slower in subjective domains (creative writing, literature) where there’s no objective ground truth.
Reward models are less efficient than ground truth by roughly an order of magnitude in terms of compute needed to reach a given accuracy.
For AI to generate novel scientific breakthroughs, the bottleneck won’t be idea generation but testing and verification of those ideas.
AI creativity can be thought of as massive brainstorming: generating many solutions and selecting the good ones. The selection problem is hard because even with 99% accuracy per candidate, picking from 1 million candidates guarantees failure with high probability.
The best current AI applications are those where generating solutions is hard but verifying them is relatively easy.
Quickfire
Underhyped: Reliability. It’s discussed but not enough, and it’s the main thing holding back AI’s real-world impact.
Overhyped: The obsession with scaling laws. Recent post-training results show that with a powerful base model and a small amount of high-quality data, you can unlock new capabilities without massive scale.
Most exciting startup category: Code assistants (Cursor, Windsurf, etc.). They’re interesting because developers are early adopters and the tools fit naturally into existing workflows. They’re a “canary in the mine” for AI adoption in other professions. Open questions remain about handling large codebases and long-term maintainability of AI-generated code.
Where to learn more: The Sky Computing Lab at Berkeley for cutting-edge research, plus the websites of Databricks, Anyscale, and LMArena.