Self-Driving Expert Unpacks the Biggest Breakthroughs and Bottlenecks

Unsupervised Learning 1h13 8 min #32
Self-Driving Expert Unpacks the Biggest Breakthroughs and Bottlenecks
Watch on YouTube

Summary

  • Vincent Vuk, a distinguished engineer at Waymo and former leader of Google’s robotics team, discusses how foundation models are transforming autonomous vehicles and robotics, what remains unsolved in both fields, and where the next breakthroughs will come from.
    • He joined Waymo after a personal accident led him to use the service extensively and experience its product firsthand, which he found “magical” and universally accessible compared to the research-focused robotics work he had been doing at Google DeepMind.
    • His perspective bridges two worlds: the mature, commercially deployed domain of autonomous driving and the still-emerging field of general-purpose robotics, offering a rare comparative view of where each stands.

How Foundation Models Are Changing Waymo’s Technology

  • Waymo’s core autonomous driving stack did not need to be rebuilt from scratch; instead, large language models (LLMs) and vision models (VLM) are layered on top as “teacher models” that distill world knowledge into the onboard models running on the car.
    • The key contribution of these models is World Knowledge: semantic understanding of scenes that the car’s own driving data may not have covered, such as recognizing police cars, emergency vehicles, or accident scenes in new cities where local variants differ from training data.
    • These models also bring scale: larger, pre-trained models on internet-scale text and visual data enhance reasoning capabilities, and this supervision improves every aspect of the driving problem without requiring architectural changes to the onboard system.
    • Not everything benefits from this approach: safety constraints, regulatory compliance, and behavioral guardrails are best expressed as explicit, verifiable rules applied outside the AI model, so that proposed driving plans can be checked against hard requirements before execution.

The State of Autonomous Driving and the Long Tail Problem

  • Waymo already operates commercial ride-hail services in cities like Phoenix and San Francisco, and its safety data shows it is already safer than the average human driver, with fewer collisions and reported injuries by a significant margin.
    • The primary remaining challenge is scaling: when you drive millions of miles, rare and unusual events—the “long tail”—become common occurrences, and solving for these edge cases dominates the engineering effort.
    • Waymo addresses the long tail through extensive simulation and scenario synthesis: creating difficult situations that may never have been observed in real data, such as drunk or adversarial drivers, and validating models against them.
    • Specific environmental conditions like snow and fog remain unsolved, but these are described as matters of prioritization and effort rather than fundamental blockers.

World Models as the Next Potential Breakthrough

  • Vincent identifies physically realistic, controllable world models as the single technical advance that could again transform autonomous driving.
    • Current proto-world models like Sora or Veo can generate plausible video futures from a scene, but they lack controllability and true physical realism.
    • The key missing ingredient is causality: the ability to understand that a specific change to the input produces a specific change in the output, which is essential for counterfactual reasoning and reliable simulation.
    • If solved, such world models would act as a digital twin of the real world, allowing Waymo to simulate and train against the long tail of rare events far more effectively than current methods allow.
    • World models have emerged first in the video generation context because the stakes for physical inaccuracy are low; the current research trend is toward making them controllable and geometrically precise enough for functional uses like autonomous driving.

Waymo’s Sensor Strategy and the Redundancy Question

  • Waymo uses a three-sensor suite: cameras, lidars, and radars, chosen because their strengths and weaknesses are complementary and they provide orthogonal signals that can be cross-validated.
    • This contrasts with companies that started from L2 consumer driving (where cost constraints favor minimal sensors) and are trying to climb up to L4; Waymo deliberately chose to “over-sensorize” first, solve the hard problem, and then optimize for cost reduction with data-informed decisions.
    • Vincent argues the bar for L4 driving is above human level, not at human level, because the business and safety case requires superhuman performance; this means the need for sensor redundancy is unlikely to disappear.
    • The question of whether a camera-only suite can achieve superhuman driving remains open and will be answered empirically over the next few years as more data becomes available.

Scaling and Expansion as the Next Milestones

  • Vincent notes that 2025 marks the 30th anniversary of the first transcontinental autonomous drive (1995, over 99% autonomy at 60+ mph), which at the time made people think the problem was nearly solved—yet it took 30 years to reach commercial deployment.
    • Waymo now has both technology validation (the system works) and user validation (people genuinely love the experience), so the remaining challenge is purely one of scaling and expansion.
    • The next major milestones will be geographic expansion: Waymo has begun data collection in Tokyo, its first international market and first left-side-of-the-road driving environment.
    • Entering a new city involves less model retraining and more evaluation and community trust-building: convincing regulators and local communities that the system is robust to local variations, setting up operational depots, and partnering with companies like Uber to accelerate deployment.

Robotics: Where It Stands Compared to Autonomous Driving

  • In robotics, the field is still chasing the nominal use case: getting a generalist robot to reliably perform basic tasks like picking up objects or making coffee, which autonomous driving has already solved.
    • Vincent does not consider the “1995 moment” to have arrived yet for general-purpose robotics, though he expects a convincing proof point within the next couple of years.
    • Current robots can generalize across visual inputs and object positions but struggle to generalize motion and skills: most demos show a robot doing one specific thing well, not adapting to truly novel tasks.
    • A commercially successful robot does not need to be general; a single-task robot that does one thing cheaply and dexterously could be a viable business, but the broader vision of a household robot that makes coffee, tidies rooms, and picks up clothes still requires fundamental breakthroughs.

How LLMs and VLMs Have Transformed Robotics

  • The biggest surprise was how quickly common sense knowledge from LLMs—knowing that a cup goes on a table, that a microwave is in the kitchen—could be turned into actionable plans for robots, even though language is imprecise.
    • This led to the insight that robot actions are just another language: not expressed in words but in body movements, and the same multimodal large model machinery can be applied to both.
    • Large multimodal models also solve a key perception bottleneck: robots can recognize entities (like a picture of Taylor Swift) they were never explicitly trained on, because that knowledge is embedded in the pretrained model.
    • The remaining bottleneck is acquiring motion and actuation data: the physical skills themselves, where the best data collection method is still an open question.

Approaches to Building General Robotics Models

  • Vincent expects the field to converge on a paradigm similar to LLMs: build a generalized backbone model that can be easily retargeted to specific tasks through prompting, fine-tuning, or test-time adaptation.
    • He segments current approaches into two camps:
      • Hardware-first: building the most capable humanoid robot, then trusting that intelligence can be layered on top.
      • Software-first: building a general intelligence model first, then retargeting it to whatever hardware platform is available.
    • His work on RTX gave him confidence in the software-first approach because the bottleneck in robotics is data acquisition speed and scale, and putting an expensive, unreliable robot on the critical path of data collection severely limits progress.
    • Simulation has worked well for locomotion and navigation but poorly for manipulation, because the sim-to-reality gap for contact physics, diversity of experience, and realistic rendering is extremely costly to close; real-world data collection has been the faster path to date.

Data Acquisition and the Path to Causality

  • Vincent advocates for more research into human-robot interaction (HRI) for data acquisition, arguing it is the biggest bottleneck in robot learning and an underexplored research area.
    • Current strategies include kinesthetic teaching, teleoperation with gloves, and simulation, but he is most excited about third-party imitation: learning from watching videos of humans, which remains unsolved.
    • Learning from observation requires solving the same causality problem as world models: inferring that “if I do this, that happens” from passive observation.
    • He is cautiously optimistic that causality may emerge from proper data engineering and curation rather than requiring new architectures, noting that chain-of-thought reasoning in LLMs emerged from data and prompting rather than architectural changes.

Key Unanswered Questions for the Next Few Years

  • Motion generalization: Can robots generalize in the space of actions the way they have generalized in the space of perception?
  • Specialization vs. generality: Are there fundamental differences between robotics and other areas of AI that will require entirely new techniques, or will existing methods (like diffusion models for motion generation) continue to transfer?
  • Scaling laws: Early evidence suggests scaling laws apply to autonomous driving models with the same functional form as LLMs but different constants; whether these hold or hit a ceiling in robotics remains to be seen.
  • World models: Whether current multimodal model architectures can become good enough world models, or whether a new architectural leap is needed.

Reasoning, Test-Time Compute, and Broader Implications

  • Vincent was personally struck by the power of reasoning capabilities in modern models: he used Gemini’s deep research to resolve a physics question that had blocked a science fiction story he had been developing for 10 years, getting a complete answer in minutes.
    • He sees test-time compute and chain-of-thought reasoning as broadly applicable beyond math and coding, to any problem where generating a solution is hard but verifying it is relatively easy—including autonomous driving, where verifying that a plan meets safety constraints is easier than generating the plan.
    • He frames this as RL done right: bootstrap with large models and supervised learning, then use reinforcement learning as fine-tuning for expert-level reasoning, rather than trying to learn everything from scratch with RL.
    • He believes the most under-discussed implication of AI progress is its impact on education: the narrative focuses on cheating, but the real story is that interactive AI agents are powerful, engaging learning tools that could transform how people acquire knowledge.

Quickfire and Personal Reflections

  • On hype: humanoid robotics is both overhyped (current capabilities don’t justify the investment levels) and underhyped (we can’t afford not to make them work, because failure would cause a “humanoid winter” that hurts all of robotics).
  • On model evaluation: he relies on leaderboard metrics rather than vibes, and tests models on real functional tasks rather than hypothetical prompts.
  • On the future of driving: he hopes for a future where people look back on human driving as crazy given the accident rates, though he acknowledges this may not happen in his lifetime.
  • On home robots: the bar for a mobile manipulator in the home is extremely high because any damage (like a nick on a wall) would be unacceptable; fixed workstation robots or near-home applications like last-meter delivery will arrive much sooner.
  • On under-discussed AI applications: he is excited about AI-designed plant-based cheese, a startup using AI to explore the design space for sustainable, non-animal dairy products, as an example of AI applied to an everyday product with massive potential impact.
    • He notes their blue cheese is already indistinguishable from cow-based cheese and is served in top restaurants.
Back to Unsupervised Learning