Jonathan Frankle is the Chief AI Scientist at Databricks, which he joined through the 2023 acquisition of MosaicML, where he was Chief Scientist. Databricks now serves over 12,000 customers on AI, and Frankle works directly with enterprises to determine when they should pre-train custom models, fine-tune existing ones, or simply use prompt engineering and retrieval-augmented generation (RAG). The conversation covers his management philosophy, the Mosaic-Databricks merger, practical guidance on model training decisions, the state of AI evaluations, domain-specific models, the future of AI architectures, policy and ethics, and where he sees the field heading.
Management and Team Motivation
Frankle uses unconventional incentives to motivate his team, including dyeing his hair blue when they hit performance targets and commissioning custom swords for team members and partners who deliver exceptional service.
For the dbrx model launch, the team set their own performance benchmarks, and Frankle offered his hair color as motivation—they hit the targets quickly and legitimately.
He believes in making people genuinely want to go the extra mile rather than relying solely on money or threats.
The Transformer Architecture Bet
Frankle has a well-known bet with Sasha (likely Sasha Luccioni or a colleague) about whether Transformers will remain the dominant AI architecture, with equity in Hugging Face vs. MosaicML at stake (now easier to value post-acquisition).
He remains confident in the bet, arguing that good architectures are extremely rare—it took roughly a generation to move from LSTMs to Transformers.
Transformers won not necessarily because they are fundamentally superior to LSTMs but because they hit a sweet spot in the hyperparameter space and the field consolidated around them.
He pushes back against the common newcomer assumption that because the Transformer was a recent breakthrough, another one must be imminent. Science moves in leaps followed by long consolidation periods, not linear progress.
He challenges the audience to name what preceded Transformers in NLP: recurrent neural networks, specifically LSTMs—which are actually a year younger than Frankle himself.
Why MosaicML Merged with Databricks
The merger made strategic sense: Mosaic had a strong AI platform but no data platform; Databricks had a world-class data platform and was making progress on AI. Together they could accelerate both missions.
Neither side was initially looking to merge—both wanted to stay independent—but the logic was overwhelming.
The teams were culturally aligned: both leadership groups were heavily composed of academics and PhDs who understood each other’s working styles.
The initial connection between Databricks’ Ali Ghodsi and Mosaic’s Naveen Rao happened at the Cerebral Valley AI Conference, which Frankle witnessed firsthand.
Guidance on When to Pre-Train, Fine-Tune, or Prompt
Frankle’s core advice: start small and work your way up, justified at each step by rigorous ROI.
Step 1—Prompting: Spend a small amount (e.g., 20 cents on OpenAI or Llama via Databricks) to litmus-test whether AI can handle the task at all. Don’t overthink it; just try it.
Step 2—RAG (Retrieval-Augmented Generation): Bring your enterprise data to bear, since generic models won’t have telepathy about your internal data. This is what Databricks calls “data intelligence.”
Step 3—Fine-tuning: If you’re getting value, fine-tune to bake knowledge into the model. Higher upfront cost but better quality in a smaller, cheaper-to-run package.
Step 4—Continued pre-training: For more domain-specific needs, continue pre-training on your own data.
Step 5—Full pre-training from scratch: Rare, expensive, and not for the faint of heart, but Databricks is one of the few platforms that can do it.
He emphasizes that the measure of your data’s quality is whether you can build a useful AI system with it, and the measure of your evaluation is whether the system works for a real human. Don’t try to perfect data or evals in advance—iterate in a tight loop with the real world.
He sees a common mistake: companies thinking they can’t start AI until their data is perfect or they have the perfect evaluation. In reality, you only learn what “perfect” means by testing against real use cases.
Building Effective AI Evaluations
Any evaluation is a proxy for the real world. The best starting point is having even one person outside the project serve as a human tester.
Frankle’s team does internal A/B testing of model outputs (including image generation and natural language) where team members compare paired outputs without knowing which model produced which—a process analogous to RLHF preference ranking.
Practical tip: Start by writing just five examples with graded responses (e.g., 1-out-of-5, 3-out-of-5, 5-out-of-of-5). That’s enough to calibrate an LLM judge that can then evaluate new outputs at scale.
Databricks has released an agent evaluation product (going into public preview around November 2024) designed to help customers create meaningful eval sets of a few dozen examples in an afternoon. Frankle is skeptical of fully automated synthetic eval data but believes in tools that amplify human judgment.
The highest endorsement of Databricks’ products, he says, is that his own research team uses all of them—Spark for data processing, Delta tables for storage, Unity catalog for data governance, MLflow for experiment tracking (which is free and open source), and Mosaic’s inference service.
When Domain-Specific Models Make Sense
Frankle sees four main scenarios where company- or domain-specific models are justified:
Language gaps: Models are often undertrained on non-English languages (Japanese, Korean, Hindi). Companies in these regions—like SK Telecom in Korea and Ola in India—have built their own models out of necessity.
Fundamentally different tasks: Protein modeling, for example, requires architectures and data that general-purpose models like Llama aren’t designed for.
Speed and cost constraints: Replit worked with Databricks to build a code completion model optimized for speed because they needed to serve free-tier users affordably.
Cost-quality trade-off at scale: If a model gets heavy usage, the upfront investment in pre-training pays for itself quickly. You can either get the same quality cheaper or better quality at the same cost.
He frames the entire journey from prompting → fine-tuning → continued pre-training → pre-training as progressively moving the cost-quality curve upward, with each step requiring more upfront investment but delivering better economics at scale.
Where AI Has Product-Market Fit Today
Frankle sees two broad patterns where AI works well now:
Low-stakes brainstorming: Cases where there are many right answers and perfection isn’t required—creative work, marketing, media, and tools like Glean that help surface information without needing to be comprehensive.
Easy-to-check outputs: Scenarios where it’s costly for a human to produce an answer but cheap to verify one. This maps to the computational complexity class NP—problems where solutions are polynomial-time checkable. Code copilots are the canonical example: it’s hard to write code from scratch but easy to glance at a suggestion and reject it.
He notes that AI’s “fuzziness” is both its superpower and its killer. It can parse documents without regular expressions but won’t always produce exactly what you expect. This is a fundamental property of the current technology, and we should be realistic about it.
Even if the technology were frozen at GPT-4 levels for 20 years, enormous value would still be unlocked just from learning to use the tool better—similar to how digital technology kept delivering new capabilities from the 1950s onward.
Ethics, Uncertainty, and Human-AI Interaction
In high-stakes domains like healthcare and autonomous vehicles, the issue isn’t just accuracy—it’s that humans have good intuitions about when other humans will fail (e.g., recognizing bad drivers) but no such intuitions for AI failures.
AI failures are often inexplicable and unpredictable (e.g., a self-driving car misclassifying a bus as a cloud), making it harder for society to build the models of uncertainty needed to rationalize mistakes.
Frankle argues we should hold humans to higher standards too. His work on law enforcement facial recognition revealed that humans are actually quite bad at cross-race facial recognition, and having a human “check” an automated system doesn’t necessarily improve outcomes. There’s promising research on “super recognizers” and training programs.
He suggests we should evaluate human drivers the way we evaluate AI—with sensors and longitudinal performance data—rather than relying on a single closed-course test.
The key policy question is: when should we allow these systems and when should we not? It’s okay to be technologically conservative in high-stakes areas (law enforcement, medicine, autonomous vehicles) where mistakes cost lives, while being permissive in green-field applications.
Databricks’ Platform Strategy and the Competitive Landscape
Frankle focuses on customers rather than competitors. His customers are trying to go from zero to one with AI and get it into production—they need all the tools to work well together under one roof.
He doesn’t believe anyone has figured out the right product yet. Key unsolved problems include: helping people measure their AI systems, navigating the RAG vs. fine-tuning vs. prompting landscape, and getting models into production reliably.
His current focus is on giving customers as many options as possible with as little cost as possible—generating many model variants from a single training run and helping customers choose the best one through good evaluation.
Databricks is not a self-contained universe. They actively partner with best-in-class startups (e.g., Shutterstock for image data, Super Annotate and Surge for data annotation) and integrate them well. The approach is “yes, and”—use Databricks for what it’s great at and partner for the rest.
Acquisitions happen when partnership isn’t enough. Databricks acquired Lilac (a data annotation tool) in spring 2024 after using it on dbrx and deciding they wanted the team in-house. The Lilac team is now building Databricks’ eval creation product.
Open Source Models and Where to Focus
Frankle is grateful for Meta’s Llama models and the Allen Institute’s open work, calling them a gift to the ecosystem. Meta’s resources far exceed Databricks’, and it doesn’t make sense for Databricks to compete on raw model scale.
Instead, he’s focused on the “last mile” gaps: evaluation creation, navigating the fine-tuning/RAG landscape, and helping customers build AI systems from whatever data they have (clean or dirty, inputs without outputs, documents without code, etc.).
He’s particularly interested in compound AI systems and agents that connect multiple pieces—including Databricks’ own tools—into coherent workflows.
What He’s Changed His Mind On
Frankle tries to keep an open mind and acknowledges he’s wrong frequently. He initially dismissed GPT-3 as merely a bigger GPT-2 producing mediocre paragraphs and didn’t appreciate the bigger picture of what scaling would unlock.
Scaling worked better than he expected. He now advises PhD students that 99% of the time, the answer to “I wonder if X will work” is no—so design experiments that always produce useful scientific knowledge regardless of outcome.
He’s uncertain whether current scaling is still serving us well and whether the next generation of models will deliver breakthroughs or diminishing returns. He keeps an open mind.
On OpenAI’s o1: He’s excited but cautious. Breakthroughs are only identifiable in hindsight (Transformers were one among many architectures at the time; GANs were hot before diffusion models). The impressive part of o1 is as much engineering as science—OpenAI’s willingness to scale up ideas that were already floating around. Databricks was already working on related ideas before o1 and has them in production for customers.
On Anthropic’s computer use: He sees it as an important application and admires Anthropic’s willingness to experiment with new products creatively, even if not all attempts succeed. The field is in a phase of trying things out.
Academia’s Role in AI Research
Frankle believes academia should “zag” when industry “zig.” Academia won’t compete on scale but can ask hard questions companies won’t ask publicly, build benchmarks and leaderboards, take risks on new technologies, and rigorously measure things.
He particularly values human-computer interaction (HCI) expertise for human-AI interaction challenges and says he’d pay top salaries for that skill on his team.
The key for academics is choosing problems carefully—talking to industry about real needs but not looking to industry for leadership on what to work on.
Data Labeling
Frankle is a consumer, not a builder, of data labeling services and defers to experts in the space. He works most closely with Super Annotate and Surge.
He emphasizes that high-quality data annotation requires deep expertise and trust—both in the artisanal judgment involved and in verifying that you’re actually getting human annotation rather than synthetic substitutes. It’s an incredibly hard operational challenge.
Policy and Trust
Frankle has spent time in the policy world and believes AI researchers owe it to society to participate in policy conversations—not out of self-interest but as members of society.
He’s wary of the trust problem: people can tell when someone is shilling for their company. Trust is built over years through integrity and transparency about biases.
He advocates for scientific honesty: putting all cards on the table, admitting uncertainty, and refusing to predict the future or promise superintelligence on a specific timeline. Being wrong is okay; being dishonest about uncertainty is not.
He supports a context-by-context approach to regulation: permissive in green-field applications, extraordinarily cautious where mistakes cost lives.
Quickfire
Underexplored AI application: He wouldn’t point to any specific domain—with 12,000 Databricks customers, everything is being explored. He’s personally excited about robotics and embodied systems, though he acknowledges it’s well-funded and its timeline is uncertain (like VR, which has been “about to happen” for decades).
$50B+ compute clusters: He declines to judge whether it’s worth it—that’s predicting the future. He’s glad someone is running the experiment and is curious about the result.
If he could parachute into any AI organization: He’d choose Databricks again. It’s where the rubber meets the road—he sees thousands of economically useful tasks and how AI intersects with them every day. By OpenAI’s own definition of AGI (better than most humans at most economically useful tasks), that story is unfolding at Databricks.
Where to learn more: databricks.com; the AI model Gateway for multi-cloud model access with monitoring and A/B testing; the agent platform and agent evaluation product (in public preview); and a new fine-tuning technique called “soft” fine-tuning for customers with incomplete data fragments. He’s easy to find online and actively looking for customers and collaborators to work with closely.