Expert AI Researcher Reacts to o1 and Shares What's Next in Reasoning and Post-Training

Unsupervised Learning 57min 6 min #20
Expert AI Researcher Reacts to o1 and Shares What's Next in Reasoning and Post-Training
Watch on YouTube

Summary

  • Douwe Kiela is a foundational figure in modern AI: he co-authored the first paper on retrieval-augmented generation (RAG), spent five years at Facebook AI Research (FAIR), led research at Hugging Face, is an adjunct professor at Stanford, and is now CEO and co-founder of Contextual AI—a company that has raised nearly $100 million to build customized, enterprise-grade language model systems.
    • This episode covers his reaction to OpenAI’s o1 model, the shift from model-centric to systems-centric AI, advances in post-training and alignment (including DPO, KTO, APO, and CLARE), the realities of enterprise deployment, the future of small and open-source models like OLMoE, and emerging trends like multi-agent systems and synthetic data.

OpenAI’s o1 and the Rise of Test-Time Compute

  • OpenAI’s o1 represents a major step in reasoning by compressing Chain of Thought (CoT) into the model via reinforcement learning (RF), effectively turning the model into a more complex system that “thinks” during inference.
    • This approach trades off latency for accuracy—making it powerful for math, law, and complex reasoning, but slower and not universally better than older models.
    • Douwe sees this as validation of a broader industry trend: moving beyond next-token prediction toward integrated systems that combine retrieval, reasoning, and generation.
    • He notes that while the ideas behind o1 aren’t new (CoT and RF have existed for years), OpenAI deserves credit for exceptional execution and integration.

Contextual AI’s Core Thesis: Systems Over Models, Specialization Over AGI

  • Contextual AI was founded on the belief that enterprises don’t need generalist AGI—they need specialized, reliable, end-to-end systems tailored to high-value, knowledge-intensive tasks.
    • Most enterprise AI failures stem from treating the LLM as a standalone component rather than part of a larger system that includes extraction, retrieval, reranking, generation, alignment, and evaluation.
    • Contextual takes a “vertical slice” approach: tightly integrating all components (retrieval, generation, post-training) and co-optimizing them for specific use cases—contrasting with the common “Frankenstein RAG” approach of bolting together off-the-shelf tools.
    • This enables higher ROI, better performance, and easier path to production—especially in regulated or high-stakes domains (e.g., finance, HR) where generalist models pose compliance risks.

The Evolution and Limits of RAG

  • Douwe co-authored the first RAG paper while at FAIR, inspired by grounding language in external knowledge (like Wikipedia) using early vector search (FAISS).
    • The original vision was more ambitious than what ended up in the paper; Contextual is now building that fuller vision.
    • RAG became popular because it solved real grounding problems, but most implementations fail at scale due to poor extraction, naive retrieval, and lack of end-to-end optimization.
    • He jokes they should’ve named it something catchier—ideally with “contextual” in it.

Enterprise AI: Demos vs. Production Reality

  • Many enterprise AI efforts stall at the demo stage because they’re built on small, curated datasets (e.g., 20 PDFs) and don’t generalize to real-world scale (e.g., 10,000+ documents).
    • Real deployment requires solving boring but critical problems: robust document extraction, scalable retrieval, security, compliance, and operational monitoring.
    • Contextual focuses exclusively on production deployments, working closely with customers to define success metrics and keep humans in the loop—especially for high-risk use cases.
    • Common successful applications include code generation, customer support, and internal search—but even these require careful scoping and integration.

Advances in Alignment and Post-Training

  • Post-training—especially alignment—is where much of the “magic” happens in making models useful.
    • RLHF (Reinforcement Learning from Human Feedback) was key to ChatGPT’s success but has drawbacks: it requires expensive reward models and manual preference annotation.
    • DPO (Direct Preference Optimization) removes the need for a reward model by optimizing directly on preference pairs.
    • KTO (Kahneman-Tversky Optimization), developed by Douwe’s team, goes further: it works with binary feedback (thumbs up/down) without needing paired comparisons—making alignment cheaper and more scalable.
    • CLARE (Contrastive Learning from Revisions) improves signal quality by comparing flawed outputs to their corrected versions, reducing ambiguity in preferences.
    • APO (Anchored Preference Optimization) accounts for the relative quality of the model generating the data vs. the preference data—preventing models from learning from inferior examples.
    • Together, these methods enable powerful alignment with less human annotation and more control over model behavior.

Small Models, Open Source, and On-Device AI

  • Contextual collaborated with the Allen Institute to release OLMoE, a high-quality open-source Mixture-of-Experts (MoE) model.
    • Motivation: lack of strong open-source MoE models for research and edge deployment.
    • Trend toward smaller, efficient models that can run on-device—especially when combined with techniques like GRIT (Generative Representational Instruction Tuning), which allows the same model weights to serve as both retriever and generator, saving compute.
    • Future vision: combine OLMoE + GRIT to build powerful, private, on-device RAG systems (e.g., on your phone).

Evaluation: The Missing Standard in Enterprise AI

  • Evaluation is one of the most underdeveloped areas in enterprise AI.
    • Most companies rely on ad hoc spreadsheets with too few examples and high variance—leading to unreliable assessments of model quality and risk.
    • There’s no standardized, principled framework for evaluating end-to-end AI systems in enterprise contexts.
    • Contextual is working on evaluation tools designed for modern “API-first” AI developers—focusing on usability over statistical rigor—so non-ML experts can reliably test and trust their systems.

Changing Minds: What Surprised Douwe Recently

  • Synthetic data works better than expected—despite flawed claims that models collapse when trained on their own outputs.
  • Agentic workflows are more viable than anticipated—even if “agent” remains loosely defined.
  • Test-time compute (e.g., Chain of Thought) is not a gimmick—it’s a powerful, practical tool that significantly boosts reasoning, especially when distilled back into models via RL.
  • He admits he was too dismissive of CoT early on, underestimating its real-world impact due to academic bias toward “elegant” ML contributions.

Data: We’re Not Running Out—But Quality Matters

  • Claims that we’re “running out of tokens” are overblown.
    • Society produces vast amounts of data daily; the issue is quality, not quantity.
    • Lower-quality data can still be useful—but requires more of it and smarter algorithms.
    • Multimodal data (especially video) is vastly underused: training on cat videos teaches “cat” far better than text alone.
    • Synthetic data, when done right, is powerful—especially when paired with advanced alignment methods like KTO and APO.

Reasoning: Beyond Math Puzzles

  • Models have had reasoning capabilities for a while; o1 didn’t invent them.
    • Real progress will come from better data (e.g., step-by-step reasoning from domain experts) and systems that combine reasoning with retrieval and verification.
    • Douwe is excited about meta-reasoning: e.g., self-referential tasks like “How many words are in this sentence?”—which require models to reason about their own structure.
    • His paper I Am a Strange Dataset explores this, inspired by Douglas Hofstadter’s theory that consciousness arises from self-referential loops.

Multi-Agent Systems: The Next Frontier

  • While still early, multi-agent systems are already emerging in practice (e.g., synthetic data pipelines, tool-using agents).
    • Douwe’s early FAIR work on multi-agent communication showed that network topology affects emergent “dialects”—a metaphor for organizational culture.
    • Future systems will likely involve humans and AI agents collaborating in shared workflows—with AI agents deciding when (and when not) to involve humans.

Hugging Face and Meta’s Open-Source Strategy

  • Hugging Face has become the central hub for model publishing—a role GitHub missed for code.
    • Its success depends on the open-source ecosystem (including Meta’s Llama), but it has carved out a unique, beloved position.
    • Meta’s decision to open-source PyTorch, React, and Llama reflects Mark Zuckerberg’s strategic vision: control the platform layer to avoid dependency on rivals.
    • This has also rehabilitated Meta’s public image and boosted its AI hiring.

Building Contextual: Pragmatism Over Hype

  • Contextual has raised significant capital but avoids overfunding relative to needs—preparing for a potential AI hype downturn.
    • They don’t train base models (relying on Meta and others), staying capital-efficient.
    • Current product is “white-glove”: high-touch, customized deployments (like selling Tesla Roadsters with engineers included).
    • Long-term goal: build the “Model S assembly line”—a turnkey, scalable platform for enterprise AI.
    • Infrastructure (e.g., reliable GPU clusters) is harder than expected—even top labs like Meta face constant hardware failures.

Quickfire Round

  • Overhyped & underhyped: Agents—they don’t work reliably yet (overhyped), but show real promise (underhyped).
  • Biggest surprise in building Contextual: How hard it is to maintain a stable, high-performance research cluster.
  • Open vs. closed models: The future is a mix—but the sweet spot is in the middle: open-source models enhanced with strong post-training to balance cost, performance, and control.
  • Most exciting non-Contextual AI company: AI-driven personalized entertainment (e.g., Suno, video generation)—moving faster than expected toward infinite, personalized media.
  • Most interesting company to run AI at: A traditional enterprise like JPMorgan—where AI can transform deeply entrenched workflows if deployed wisely.
  • If not building Contextual: He’d still focus on transforming work through AI—but might explore entertainment applications.
  • Where to learn more: contextual.ai, and Douwe on Twitter (he’s the only one with his name).
Back to Unsupervised Learning