Jeff Dean (Google’s Chief Scientist) and Noam Shazeer (co-inventor of the Transformer, Mixture of Experts, and other foundational LLM techniques) reflect on 25 years at Google, from early search infrastructure to leading the Gemini project at Google DeepMind. They discuss how hardware-software co-design, scaling laws, and algorithmic breakthroughs have driven progress, and share their vision for the future: modular “organic” models, inference-time scaling, automated chip design, and the possibility of rapid capability gains through feedback loops between AI and hardware improvement.
How they joined Google and how the company evolved
Jeff joined around 2000 when Google was ~25 people; Noam joined in 2000 after seeing a crayon chart of exponentially growing search queries at a job fair and deciding the company would be successful.
Early on, everyone knew everyone and what they were working on. As Google grew, that became impossible — first losing track of names, then projects — but maintaining a broad network lets you find the right person with “one level of indirection.”
Noam’s original plan was to make money at a startup, then work on AI independently. Instead, Google turned out to be the ideal place to do AI research, given its ambition to “organize the world’s information,” which inherently requires increasingly advanced AI.
Moore’s Law, hardware specialization, and the arithmetic-vs-data-movement tradeoff
Two decades ago, general-purpose CPU improvements gave you ~2x every 18 months for free. Recently, CPU scaling has slowed (fabrication improvements now take ~3 years instead of 2).
The rise of specialized ML accelerators (TPUs, ML-focused GPUs) has more than compensated: arithmetic is now extremely cheap, while moving data is comparatively expensive.
Deep learning took off because it can be expressed as matrix multiplications — N³ operations on N² bytes of data — which maps perfectly to hardware optimized for dense linear algebra.
If the tradeoff had been reversed (cheap data movement, expensive arithmetic), AI would likely have looked more like large lookup tables or memory-based systems rather than compute-heavy neural nets.
Early neural net work: 1990 through 2007
Jeff’s 1990 undergrad thesis implemented data parallelism and model parallelism for backpropagation on a 32-processor Hypercube. He was excited by the abstraction but naive about scale — real progress required ~1 million times more compute than was available then.
In 2007, Jeff worked with Franz Och’s machine translation team, which had won a DARPA contest but took 12 hours per sentence. Jeff built an in-memory compressed N-gram data structure (five-grams over 2 trillion tokens) that reduced translation time to ~100 milliseconds.
Noam built a spelling correction system in 2001 that ran in-memory on a single machine and could handle heavily butchered queries (e.g., “scrumbled uggs Bundict” → “scrambled eggs benedict”).
At the time, neither saw N-gram models as a path to AGI. The field was more excited about Bayesian networks. But Noam recognized language modeling as “the best problem in the world”: simple to state (predict the next word), infinite self-supervised training data, and AI-complete if solved well.
”Holy shit” moments and the scaling hypothesis
Early in the Google Brain era (~2012), the team trained an unsupervised model on 10 million YouTube frames across 2,000 machines (16,000 CPU cores). It spontaneously developed a cat-detecting neuron — never told what a cat was, but learned from seeing enough examples.
That model also achieved a 60% relative improvement on ImageNet’s 20,000-category challenge, advancing state of the art. It was ~50x larger than any previously trained neural net, and the results convinced Jeff that scaling was the right strategy.
Google’s mission: from information retrieval to information creation
Google’s mission to “organize the world’s information” is broader than retrieval — it includes creating new information from user guidance (e.g., drafting a letter to your vet, summarizing a video, synthesizing answers from 100 web pages).
Multimodal capabilities (text, audio, video, lidar, genomic data) expand what “information” means. Jeff sees this as a trillion-dollar opportunity, but the real value is when these systems can do things — write code, solve problems — not just retrieve.
A key frontier is universal language access: making any content available and usable in any of the world’s thousands of languages.
Long context: merging search with in-context learning
Google Search has the entire internet in its index but does shallow matching. Language models have limited context windows but can “think” deeply about what they see (in-context learning).
Current models handle millions of tokens of context (hundreds of PDFs, 50 research papers, hours of video). The goal is to let models attend to trillions of tokens — the entire internet, all your personal emails/documents/photos.
The challenge: naive attention is quadratic. You can’t scale it to trillions of tokens without algorithmic approximations.
Model parameters are memory-efficient for memorizing facts (~1 fact per parameter), but context tokens are expensive (kilobytes to megabytes per token across all layers’ keys and values). Innovation is needed to minimize this and find better ways to access relevant information.
Internal coding models and developer productivity
Google has further-trained Gemini on its internal monorepo for internal developer use. Sundar Pichai has said 25% of characters checked into Google’s codebase are now AI-generated (with human oversight).
Jeff imagines a near-future workflow: a researcher says “explore this idea from this paper but make it convolutional,” the system generates experimental code, the researcher reviews it and says “run that.” This could be plausible within 1–2 years.
External experiments already show models implementing a full SQL database in C from a paragraph-long prompt — parser, tokenizer, query planner, on-disk storage — demonstrating a major productivity boost.
Autonomous software engineers and managing parallel work
Future coding models may run for 10+ minutes, pause to ask clarifying questions (“video or just images?”), then continue. Managing workflows with many background AI tasks will require new interfaces — analogous to moving from 1930s ticket trading to modern electronic trading.
Jeff draws a parallel between parallelizing human researchers and parallelizing machines. The coordination challenges are similar.
If the global AI research community grows 1,000x (from ~10,000 to ~100,000+ researchers), and each can run experiments faster, you could see breakthroughs on the scale of the Transformer happening much more frequently — potentially daily.
Algorithmic progress and the feedback loop with hardware
Model improvements generation over generation come not just from more compute but from major algorithmic advances — architecture changes, training data mix improvements, and better techniques that make models better per flop.
If AI systems can automate the exploration of architectural ideas (vetting 1,000 ideas at small scale, scaling up promising ones), the pace of algorithmic progress could accelerate dramatically.
The bottleneck is that largest-scale experiments are still N=1: you put brilliant people in a room and stare at the results. More hardware helps, but doesn’t eliminate this.
Jeff is excited about using AI to dramatically speed up chip design. Current process: ~18 months to design a chip (150 people), then ~4 months to fab. If design could be automated to weeks, the fab time (3–5 months at leading nodes) becomes the dominant bottleneck — and that’s already comparable to training run lengths.
This creates a feedback loop: better AI → better/faster chip design → more compute → better AI. Combined with algorithmic progress, this could compress the timeline between model generations from years to months.
Inference-time scaling
Even a large language model doing a trillion operations per token costs ~10⁻¹⁸ dollars per operation — roughly 1 million tokens per dollar. A paperback book costs ~10,000 tokens per dollar. Customer support, software engineers, doctors, and lawyers are 10,000–1,000,000x more expensive per token equivalent.
This means there’s enormous headroom to spend more compute at inference time to get smarter answers. “Think harder” at inference is a major near-term source of capability gains.
Techniques include: search/exploration of multiple solution paths, iterative information gathering, and “drafter” models (a small model proposes 4 tokens, the big model verifies them in parallel, advancing past agreed-upon tokens — turning sequential decode into parallel verification).
Inference compute can be latency-sensitive (user waiting) or asynchronous (background tasks like Deep Research, which produces an 8-page report with 50 citations in ~2 minutes). The asynchronous category will grow and requires new UI patterns for managing multiple background tasks that may need user input mid-execution.
Multi-data center training and debugging at scale
Google already trains across multiple metro areas with high-bandwidth connections. As long as gradient accumulation and parameter sync happen within one step time (a few seconds for large models), the added latency (~50ms) doesn’t matter — only bandwidth does.
Early Brain work used asynchronous training on CPUs (each copy computes locally, sends gradients to a central server). This worked but made experiments non-reproducible. Moving to synchronous training on TPU pods was a major quality-of-life improvement.
At very large scale, some asynchrony may return because it enables scaling, but the goal is to make it debuggable (e.g., by logging the sequence of operations for replay).
Debugging at scale: small-scale experiments (1–2 hours) let researchers vet ideas quickly. But ~50% of improvements that work individually don’t stack well together when combined — unexpected interactions between components are common. Managing codebase complexity while integrating improvements is an ongoing challenge.
Fast takeoff, alignment, and safeguards
Jeff and Noam see a plausible scenario where feedback loops between AI-improved algorithms and AI-improved hardware lead to rapid capability gains — going from “pretty good ML researcher” to “superhuman intelligence” in a few generations (each generation currently ~2 years, potentially compressing to months).
The key capability jump: from reliably solving 5–10 step problems 80% of the time, to solving 100–1,000 step problems 90% of the time.
Jeff’s position is “somewhere in the middle” between “AI will overwhelm us” and “don’t worry at all.” He advocates for shaping AI deployment — steering it toward beneficial areas (education, healthcare) and away from harmful uses (misinformation, automated hacking) through both policy and technical safeguards.
On alignment: analyzing text is easier than generating it, so language models can be used to analyze and flag problematic output. Exposing capabilities through APIs provides a control layer for understanding and bounding use.
Before letting Gemini N write AI research code, Jeff wants human oversight of the exploration process — the AI explores and produces results, but humans decide what gets incorporated. He’s against full self-improvement without human review.
Noam notes that if you can distill a capable model into a smaller, efficient form, you can control what gets deployed. Distillation is a key tool for safety and accessibility.
Fun times at Google and the micro-kitchen culture
Jeff’s most nostalgic period: the first ~5 years, when he was one of a handful of people on search/crawling/indexing, traffic was growing exponentially, and you knew everyone. Equally exciting now: working on the Gemini team and seeing model capabilities advance dramatically.
The Gradient Canopy building (formerly Charleston East) has a micro-kitchen area with ~50 desks, an espresso machine, and constant foot traffic. It creates spontaneous face-to-face idea exchanges. There are ~120 Gemini-related chat rooms for distributed team members worldwide.
World compute demand in 2030
Inference compute demand will grow from multiple factors: (1) each request becoming 10–1,000x more computationally intensive via inference scaling, (2) global adoption going from ~10–20% of computer users to ~100%, (3) models getting larger.
Jeff’s Fermi estimate: if every person had a personal AI assistant (in glasses, earpiece, etc.) that got smarter the more you spent on compute, and AI made engineers 10–1,000,000x more productive, people would spend significant fractions of income on it. World GDP itself could grow 100x due to artificial engineers solving energy, carbon, and resource constraints.
The sun outputs ~10²⁶ watts. If even a fraction of that were directed toward AI compute, the scale would be astronomical.
Google is investing heavily in capital expenditure (data centers, custom hardware) to meet this demand, though Jeff won’t comment on specific future plans.
Modular, organic models and continual learning
Noam envisions moving beyond the regular structure of current Mixture of Experts (where all experts are the same size and merge back quickly) to a more organic architecture:
Different specialized portions of the model, good at different things, with variable computational costs (factors of 100–1,000x difference).
Paths that branch and don’t merge for many layers (e.g., math reasoning stays separate from image understanding).
Modules that can be developed independently by different teams — e.g., a team focused on Southeast Asian languages trains a module that plugs into the larger model.
This enables continual learning: upgrade one module without retraining everything.
Jeff draws inspiration from biological brains: dense local connections (within a chip), fewer connections to nearby chips, even fewer across pods, and minimal communication across metro areas — each level sending only the most salient representations.
The system would adapt its connectivity to the hardware organically, rather than being hand-specified.
You could still distill from this “organic blob” into efficient, regular models for specific tasks whenever needed (daily, hourly, etc.).
Implications of the modular “blob” paradigm
Instead of spinning up N copies of a model for N engineers, you activate different sub-patterns within one giant model — 10 engineers worth of output vs. 100 engineers worth is a different activation pattern, not more instances.
Compute per inference could vary by factors of 10,000–1,000,000 depending on problem difficulty.
This requires infrastructure that can hold the entire model in memory across a pod/cluster (Google’s TPU pods are well-suited). Companies without data center-scale infrastructure would struggle to train such models.
Experts could vary in how often they’re called — frequently-used experts get replicated for load balancing; rarely-used ones (e.g., Tahitian dance) could page out to slower memory.
Google could have one massive base model with different modules for different products (Search, Gmail, YouTube) and different access levels (internal-only modules trained on sensitive data, company-specific modules for cloud customers).
What’s missing: distillation and training objectives
Distillation needs to be much faster. The ideal: instantly distill from a giant blob onto a phone. Current techniques are too slow.
Current training (next-token prediction) may not extract maximal value from every token. Noam suggests:
Making the model work harder on certain tokens (e.g., do more computation when it reaches “the answer is”).
Hiding parts of the input and forcing inference from partial information (analogous to vision techniques like dropout, masking, and data augmentation).
Multiple passes over data with noise/dropout to prevent overfitting — humans have seen ~1 billion tokens and are very capable, so there’s room to extract more from existing data.
Humans learn differently from LLMs: they take actions and observe consequences, do thought experiments, and explore without external data (e.g., chess self-play, Einstein’s thought experiments). Models that can act and observe results during learning would be far more sample-efficient.
Noam’s Gato-like vision: a model that observes, takes actions, and learns from the corresponding results.
Open research: pros, cons, and Google’s evolving approach
Noam published the Transformer paper in 2017; it created hundreds of billions in market value across the industry. In retrospect, he’s glad it got out — the pie isn’t fixed, and the overall growth in GDP, health, and wealth from AI benefits everyone.
Google’s publication strategy has evolved: some things are published immediately, some are productized first and published later (e.g., Pixel camera techniques like Night Sight), and some critical advances are kept internal.
Jeff values conferences like NeurIPS (15,000 attendees) for the free exchange of ideas, and Google continues to publish many papers there.
Noam notes that Google had many insights early (Meena, an internal chatbot Googlers used during the pandemic; seq2seq; BERT; Transformers) but was slower to release a public chatbot because of concerns about hallucination/factuality and safety. In retrospect, they underestimated how useful people would find chatbots for tasks you wouldn’t ask a search engine (drafting, summarizing, brainstorming). The path they took — waiting until the models were quite capable — wasn’t bad, and Gemini is now a top model.
Career longevity and breadth
Jeff’s approach: pay attention to evolving research landscapes, talk to colleagues, dive into new areas (chip design, healthcare), and work with small groups of people with complementary expertise. Their expertise rubs off on you, expanding your tool belt.
Noam emphasizes humility — being willing to drop your own idea as soon as you see something better. He also notes the importance of incentive structures: Google Brain used a bottoms-up “UBI” chip allocation system (everyone got credits, could pool them), which encouraged flexibility. Gemini has been more top-down, which encourages collaboration but can create incentives to overstate how well your project is working. The right balance of top-down and bottom-up incentivizes both collaboration and flexibility.