Gemini Co-Lead on World Models, RL's Next Domains & Continual Learning

Unsupervised Learning 59min 6 min #66
Gemini Co-Lead on World Models, RL's Next Domains & Continual Learning
Watch on YouTube

Summary

  • Oriel Vignal, co-lead of Gemini alongside Noam Shazeer and Jeff Dean, sat down the day after Google I/O 2025 to discuss the research behind the announcements and where frontier AI is headed. The conversation covers world models, consumer agents, memory and continual learning, the future of RL and post-training, and how Google balances focused research with broad exploration.

World Models and the Path to Deeper Understanding

  • Google shipped Omni, a multimodal world model that can both understand and generate video and images, and interact with them through natural language. Oriel traces this back to the original Gemini vision: jointly modeling language, vision, and video from the start, rather than treating language as the only important modality.

    • He argues there hasn’t yet been a “GPT moment” for video and images — a point where training on visual data alone, without explicit language labels, yields the same depth of understanding that language models get from text. That remains one of the core open challenges in machine learning.
    • The key unsolved problem is representation learning: how to extract concepts like gravity, causality, and physics from raw video without relying on language as a crutch. Current approaches depend on labeled datasets, which are far smaller than the total pool of visual data. Pure unsupervised concept extraction from images and video is still largely in the research stage.
  • What makes Omni a “world model” rather than just a good video generation model is that it acts as a renderer of the world — you can instruct it in language to change how a scene behaves, simulate movements, and predict what happens next. This has direct applications in robotics, self-driving cars, and simulation.

    • For robotics, there’s a two-way synergy: robot-collected data can improve world models, and world models can simulate training scenarios for robots without the cost and latency of the physical world. But a significant gap remains in precision — fine motor control, tactile feedback, and exact force simulation are still far from solved.
  • Evaluating whether a world model truly understands physics is itself an open problem. If you ask a model about gravity in language, it can answer from its training text, so that doesn’t test visual understanding. Oriel suggests the real test is whether you can decode conceptual understanding from the model’s internal representations without language as an intermediary — an area with early research but no established methodology.

Consumer Agents and the Evolution of Scaffolding

  • Google shipped consumer agents in Spark that represent a meaningful step up from earlier efforts like Project Mariner (2024). Oriel attributes the improvement to a pattern seen across AI: first you get the model good, then you build a system around it, then you optimize the system and model jointly.

    • He sees a clear trajectory from specialized to general systems. Spark is initially built somewhat narrowly around scheduling and personal assistance, but the long-term bet is that a generic system plus a sufficiently intelligent model will handle specialization through instruction rather than bespoke engineering.
  • On the “bitter lesson” — the idea that scale and generality eventually beat hand-engineered structure — Oriel believes today’s complex agent scaffolding (multi-agent delegation, long-running orchestration) is itself a candidate for being replaced by models that write their own scaffolding on the fly. He draws a parallel to reasoning models: the breakthrough wasn’t just that models could reason, but that they learned when and how long to reason based on task complexity.

  • For long-running agent reliability, he sees the answer in both better scaffolding and better-trained weights. The model needs to be trained on distributions that include very long-horizon tasks, so it learns to handle extended context rather than relying on prompt-induced generalization.

Memory and Continual Learning

  • Oriel breaks memory into two useful levels: working memory (the context window, already powerful with hundreds of thousands or millions of tokens) and episodic/long-term memory (a retrieval system for everything a user or agent has experienced).

    • The current practical approach is file-system-style non-parametric memory: agents write structured knowledge to files and directories, then retrieve it as needed. This works reasonably well today, but the model weights haven’t caught up — they aren’t yet trained to optimally use such a system.
    • He prefers this non-parametric approach over modifying weights per user, because serving one model with different weight sets for different users is practically very difficult. Instead, the weights stay shared, and each user gets their own knowledge base.
  • Continual learning is a major research focus. Oriel sees it as potentially paradigm-shifting in the way reasoning was 18 months ago. He acknowledges that some researchers are leaving large labs to pursue it independently, but argues that staying connected to the frontier of LLM capability is important because that capability enables or disables certain research directions.

    • He believes the right organizational model combines protected research time with tight integration into the core modeling effort — a balance Google DeepMind attempts by having Gemini as a focused, unifying force while still investing in longer-horizon bets.

Post-Training and RL: The Next Domains

  • Oriel describes post-training as still “total green field” — not because current models lack capability, but because the amount of compute invested in post-training is still small relative to pre-training, and the field hasn’t cracked how to generate infinite training data the way games like Go do.

    • In Go, every few moves create a unique board state, so the environment generates infinite complexity for free. In LLMs, the source of equivalent infinite complexity isn’t clear, and data scarcity is the bottleneck. Cracking this recipe could be transformative.
  • He’s most excited not about domain-specific RL gains (coding, math) but about meta-capabilities: the ability to learn from experience efficiently, adapt to new tasks in context, and follow complex instructions in unfamiliar domains. These are the traits of general intelligence.

    • His favorite informal eval: give a model the instruction manual for a game it has never seen (like Civilization) and see if it can learn to play — and improve as it plays. Current models struggle with this, especially for truly novel games.
  • On RL generalization: he’s seen reasoning trained primarily on math and coding generalize to unexpected domains (like tax questions), which surprised him. But he’s uncertain whether narrow training on hard problems is enough to induce general problem-solving, or whether broad distribution training is ultimately needed. His instinct is that broader training helps, but the generalization from narrow domains has been stronger than he expected.

Recursive Self-Improvement and Innovation

  • Oriel is actively working on self-improvement — models that can reprogram and improve themselves. He’s seen models demonstrate superhuman mechanistic understanding of how training works, but hasn’t yet seen them generate truly novel, high-level research ideas.
    • He considers the ability to innovate — especially in science — as one of the hardest capabilities to reach and evaluate. Innovation is inherently hard to verify, and what looks like genius is often the result of many attempts with selective memory of the successes.
    • He believes there are natural physical limits to recursive self-improvement: training requires energy, hardware, and time. Even with the perfect recipe, there are rate limits. He also notes a subtle point: at some point a model may become “too good” at certain tasks (like writing English), and the goalposts will keep moving.

Organizational Strategy and Compute

  • Google’s unique position comes from having its own hardware (TPUs), end-to-end revenue streams, and a unified Gemini effort that serves as a focused frontier capability center while leveraging the broader organization’s stability.

    • On selling compute to Anthropic: Oriel frames it as a strategic reinvestment decision. Compute is used for serving, training small models, and training frontier models. Selling some compute creates revenue that can be reinvested, and the multi-pronged strategy reflects different timelines and investment horizons.
  • The advantage of having hardware and models under the same roof is the ability to co-evolve them. Oriel recalls early days (2013-2014) when he, Jeff Hinton, Jeff Dean, and Ilia Sutskever were in a room deciding how many GPUs to put in servers — a decision with multi-year consequences. That tight feedback loop between research direction and infrastructure investment continues today and is a structural advantage.

Advice for Builders

  • For founders deciding whether to work at the model layer versus building on top: Oriel emphasizes the value of evaluations and data. Even without building your own model, thinking carefully about how to measure progress on your specific problem is enormously valuable and could become a standard eval that even large labs adopt.
    • He suggests that specializing the product — deeply understanding a specific domain, getting users, building critical mass — creates defensible value even if you don’t train weights. As models get better at continual learning and using knowledge bases, building rich, specialized knowledge bases for particular applications may be a more scalable moat than trying to out-train the largest labs on base model quality.
Back to Unsupervised Learning