OpenAI's Head of Product: How the Best Teams Build, Ship and Scale AI Products — Unsupervised Learning

Olivier Godement, Head of Product for Enterprise at OpenAI, discusses the latest model releases, enterprise adoption patterns, the state of AI agents, and where the industry is heading.

OpenAI shipped GPT-5.1 and GPT-5.1 Codex, designed to preserve the intelligence of GPT-5 while dramatically reducing latency by compressing thinking tokens.
Codex has seen rapid adoption, with OpenAI engineers reportedly pushing ~70% more code because of it; it is the most dogfooded model internally.
A surprising area of growth is scientific research: scientists are using GPT-5.1 to aggregate literature, test hypotheses, and accelerate discovery.
- In one example, GPT-5 Pro reproduced the math of a newly released physics paper in ~30 minutes, work that took the original physicist weeks.

Building reliable agents is hard: beyond the model, teams need strong harnesses (tool integration, evaluation frameworks, human-in-the-loop flywheels).
Strong automation is already happening in coding and customer support, with meaningful enterprise deployments (e.g., T-Mobile).
Life sciences and pharma (e.g., Amgen) are on the cusp: the admin/regulatory work of drug development—generating complex documents, managing reviews—is a strong LLM use case once versioning, auditing, and permissions are solved.
Hedge funds, investment firms, and banks are also adopting AI for real-time data aggregation and analysis.

OpenAI works directly with large enterprises on the most complex, high-stakes problems, but the mission is also to enable the ecosystem.
Apps in ChatGPT, announced at Dev Day, is a major vector: enterprises want employees to do more inside ChatGPT, and startups can build specific features on top of ChatGPT’s adoption, memory, and connectors.
ChatGPT is becoming the “first place you check in the morning” (e.g., Pulse preparing your day), with deeper workflows still happening in specialized tools.
Even if model capabilities froze today, there would be many years of enterprise adoption work left.

The most impactful frontier is continuous learning: agents that update their weights based on human feedback in real time, rather than relying on static prompting.
This would unlock coding, customer support, finance, and more—agents would improve on the job like a human intern getting better over time.

The dominant categories with strong product-market fit are coding, customer support, finance, and healthcare/life sciences.
Voice is also emerging as a major category.
The strategy is to go deeper in these proven domains rather than constantly expanding into new ones.

Scaffolding is still fairly bespoke across industries; there is no standard agent architecture yet.
The industry is converging on:
- Code as the interface: giving agents access to shells, scripts, and tools, since models are improving fastest at code.
- MCP (Model Context Protocol) as a standard for data/API connections.
- Better evaluation through trace generation and analysis.
Standardizing agent architecture would make it easier for customers to compare models and adopt new ones.

The cost of GPT-4-level queries has dropped 1–2 orders of magnitude in ~3 years through model compression, better hardware, and improved networking.
Cost still blocks some use cases (e.g., personalized content on every homepage); driving cost down further is core to OpenAI’s mission.
For high-leverage use cases like coding, the economics already work; for others, further cost reduction will unlock massive latent demand.

RFT is not yet widely adopted; most enterprises are still catching up to frontier base model capabilities.
Early innovators (e.g., an accounting software firm) have used RFT with small sets of high-quality examples to improve model performance 20–30% on gold-standard evals, crossing the threshold from non-viable to viable.
RFT remains heavy-handed (requires building environments, graders, and waiting hours/days), so it will likely stay a tool for frontier-pushing teams rather than mass market.
Most enterprises will rely on base model improvements for the bulk of their automation.

Three buckets: (1) capabilities and behavior, (2) cost and latency, (3) “vibes” / social media perception.
Industry benchmarks (e.g., GDP for economic tasks, Tow Bench for services) are emerging, but many startups still rely on qualitative testing and “taste.”
Cost is expected to continue dropping significantly over the next year.

Hot-swapping models by changing an API parameter is largely gone for non-trivial use cases; each model has distinct idiosyncrasies in instruction-following, tool signatures, and long-context handling.
Top teams take time to battle-test new models, come with strong taste but an open mind, and give specific feedback to OpenAI so post-training can be adjusted.
The industry needs more predictable release cadences with clear changelogs, similar to traditional software versioning.

GPT-4o’s voice capabilities were a breakthrough in expressiveness and tone, but the voice Turing test has not yet been crossed—interruptions and cadence still feel off.
The goal is naturalness and expressiveness where users are comfortable being served by AI over a human.
Multilingual capabilities are critical for customer support, where staffing agents for every language is impractical.

Codex’s success comes from a small, focused team iterating on model, harness, integration, and data.
The next unlock for AI in software engineering is collaboration: models need to handle communication, scoping, architectural decisions, and on-call work, not just code generation.
Enterprise adoption of agentic coding tools is reaching critical mass; 2025 is expected to be “the year of coding in the enterprise.”
Model providers are moving from being pure inference API providers to offering model + harness + reference design as a package.

Data infrastructure first: clean data, proper APIs, authentication, logging, and permissions are prerequisites; without them, even the best agent is useless.
Rigorous evaluation: build golden sets, document SOPs, and create evals; most enterprise knowledge lives in people’s brains, not written docs, so finding the right experts is key.
Change management: take time to explain how the technology works to teams and customers; the technology is new to everyone.

Underhyped: science and drug discovery. Even a 5% acceleration in the rate of discovery has enormous compounding implications for the economy and technology.
Changed his mind on: “the model is everything.” Harnesses are becoming equally important, and high-quality data infrastructure is critical for enterprise adoption to scale.

OpenAI has shipped features and tools that didn’t find product market fit, invested in audio technologies that didn’t pan out, and defined “agent” APIs too early in 2023 that didn’t catch on—a healthy amount of experimentation and failure underlies the visible successes.

Summary