Lukasz Kaiser, co-author of the Transformer paper (“Attention Is All You Need”) and former researcher at Google and OpenAI, discusses the current state and future of AI research — from whether Transformers are enough to reach true generalization, to the coding agent revolution, to the open vs. closed source landscape, to why he believes this is still the most exciting time to be an ML researcher.
Generalization: Is Reasoning Enough?
The central open question is whether current Transformer-based models with reasoning and RL can achieve human-like generalization, or whether a fundamentally different approach is needed.
Models today are incredibly capable — they solve research-level math, write production code, and reason about hard problems — but they learn very differently from humans.
The “exhaust all options” problem: LLMs tend to learn concepts only after seeing massive amounts of surface-level data, whereas humans form concepts from very few examples and can make creative leaps.
There is a growing “whiff in the air” among researchers that something beyond Transformers could generalize better, but every time people try to pin it down, Transformers catch up — so the case for alternatives has strengthened even as Transformers have improved.
Physical world as a key test case: Self-driving cars still struggle with novel situations like construction zones despite millions of miles of data, suggesting current architectures hit a generalization wall in embodied settings where data doesn’t scale the way text does.
What Comes After Transformers?
It’s unclear whether the next breakthrough will come from tweaking architecture, data, loss functions, optimization, or all of them together.
Attention will likely survive in some form, but recurrence may also return — reasoning already brings a form of recurrence since the same weights generate each new token.
Small post-Transformer models like TRRM and HRM perform well on puzzles like Sudoku and ARC-AGI by adding recurrence and architectural tweaks, but it’s unclear whether these ideas scale to language.
The agent revolution: Coding agents like Codex have fundamentally changed how Kaiser works — he estimates a 5–10x productivity increase, and more importantly, the ability to work on multiple things in parallel while maintaining mental clarity about the big picture.
He no longer reads most of the code agents produce, but maintains full conceptual control over losses, batch structure, and what the model is actually doing — which he says makes him sharper, not less sharp.
Are We Close to an AI Research Intern?
Models feel close to an intern level but still require careful supervision — they can add unrequested losses or make trivial tweaks when given open-ended goals like “make a better model.”
Long context problem is being solved with hacks that work: Rather than architectural solutions, agents use tools like RAG, file systems, and compaction (summarizing context to stay within limits), trained with RL. Kaiser calls these “band-aids” but acknowledges they work remarkably well.
The jump in coding agent capability around Christmas 2025 was significant but hard to attribute to any single change — it involved a combination of better harnesses, post-training improvements, and new pre-trained models.
The meta-research problem: For an agent to become a true researcher, it would need to learn from weeks-long research cycles, but current RL requires running all rollouts, making such long-horizon training impractical. Humans manage to do research over years without having done thousands of prior research projects.
RL Beyond Verifiable Domains
Progress on non-verifiable tasks (law, medicine, creative work) is real but relies on finding proxy verifiers — rhyme schemes for poetry, rubrics for legal reasoning, human preference clicks for image beauty.
Verifiability is a spectrum, not a binary: Even math isn’t fully verifiable unless formalized in something like Lean; coding is more verifiable but front-end coding is less so.
The risk of the current RL approach is that models can nail every measurable benchmark while still lacking taste or deeper understanding — you can verify many things and still have “no taste.”
Generalization across domains does exist but is jagged: Models trained on math RL show improvement in unrelated areas like law, but they fail to generalize in seemingly obvious ways (e.g., from one branch of math to another), revealing a fundamentally different and alien form of generalization.
Application Companies and the Model Landscape
Bigger pre-trained models make everything easier — fewer sharp edges, better RL outcomes, easier fine-tuning — and this trend has continued despite predictions that small models would dominate.
Hardware accessibility is transformative: A single consumer GPU (RTX 5090) delivers roughly the compute of five machines used in the original Transformer research. This means individual researchers or small labs can now run meaningful experiments, including potentially simulating years of human-like learning in days.
The ability to run brain-scale compute on a desktop (hundreds to thousands of dollars) could enable radical research ideas to be tested without needing a major lab — though scaling up still requires lab-level resources.
Multimodal Models: Still Missing Something
Current multimodal approaches still predict pixels or patches autoregressively, which feels fundamentally wrong compared to how humans process sensory input in parallel at every moment.
The sequential bottleneck: Transformers split images into patches and process them sequentially, making it impossible to absorb a high-resolution image every millisecond the way biological systems do.
Multi-stream architectures (like those from Thinking Labs) that process multiple modalities in parallel feel like an obvious and potentially transformative tweak, but no major lab has fully committed to this direction yet.
OpenAI’s Bet on Reasoning
OpenAI’s decision to pivot to reasoning models — at a time when they were less chatty, slower, and harder to use than pure chat models — was a defining and brave bet that paid off.
The company had to maintain two parallel model lines (chat and reasoning) during the transition, which was operationally difficult, but the commitment to reasoning gave them a lasting quality advantage in RL that some larger labs still struggle to match.
The challenge of scale: As labs grow (OpenAI has grown ~20x), it becomes harder to take wild bets because there’s more to lose and more processes in place. Kaiser hopes labs retain the ability to make bold moves.
The AI Coding Wars
Anthropic succeeded in coding because they made a deliberate strategic bet to focus on it when they couldn’t compete with ChatGPT — a decision that positioned them ahead when coding agents became the defining AI capability of 2026.
OpenAI caught up quickly because they had coding capability but it wasn’t their focus — illustrating the tension between nailing today’s winning product vs. keeping bets open for the next wave.
The real question is expansion beyond coding: Coding agents are powerful but require a learning curve that limits adoption. The bigger challenge is making this power accessible to people in other occupations — accountants, lawyers, doctors — who may not want to learn to work with a tool called “Codex.”
Google’s strategy of keeping many areas open (rather than focusing) means they can catch up quickly when something works, but they may not be the first to nail any particular domain.
Open Source vs. Closed Source
Distillation works but distilled models are never quite as good as the originals — Kaiser notes that Gemini 2.5 Flash doesn’t feel on par with 2.5 Pro for his use cases.
There are strong incentives for both open and closed models to persist: companies and countries want sovereign AI capabilities (not depending on a single provider that could have an outage), while labs have incentives to stay ahead to justify pricing.
The gap will likely persist but not become catastrophic — open models will be good enough for many use cases, while closed models maintain an edge on the hardest problems.
Quickfire
Biggest mind change: Kaiser went from barely using AI to using coding agents constantly and no longer using a traditional editor — he now just tells agents to change code directly.
Safety concerns: Haven’t changed much — he remains in the “not too worried but don’t be complacent” camp, focusing on concrete risks like systems hacking infrastructure rather than existential scenarios.
On Andre Karpathy joining Anthropic for AGI research: Kaiser is skeptical that even AI-assisted research will quickly crack post-Transformer architectures, because the space of ideas is vast and most are wrong — efficient search doesn’t guarantee finding the right answer.
Why he hasn’t started a company: He loves technical work and has been happy at Google and OpenAI, though he acknowledges the privilege of being able to focus on research.
Final Thought
This is still the most exciting time to be an ML researcher: powerful GPUs under your desk, coding agents that push them to their limit, and a “whiff” of potentially transformative new ideas in the air.
Kaiser encourages researchers to pursue wild ideas even if they fail — the space of wrong things is where breakthroughs come from, and models are still bad at learning from totally wrong directions, which is exactly where humans excel.