Anthropic's First PM: Opus 4.5, Rethinking Model Scaffolding & Safety as a Competitive Advantage — Unsupervised Learning

Diane, head of product for research at Anthropic, discusses the launch of Opus 4.5, a major model release that delivers significant gains in coding, computer use, long-running agent tasks, and enterprise productivity, while being more cost-efficient than prior Opus models. The conversation covers how Anthropic plans and builds models, how they prioritize capabilities, the evolving role of scaffolding and evals, and why Anthropic’s focus on safety and alignment is itself a competitive advantage in producing higher-quality, more independent-thinking models.

How Anthropic Builds New Models

Anthropic follows a long-range roadmap focused on core capability improvements: better instruction following, coding, memory, and practical task performance across domains like Excel and PowerPoint.
- Each new generation of Claude is the vehicle for delivering these improvements, and product decisions (pricing, positioning, packaging) are made alongside the research.
The process starts with envisioning what users will want, then translating that into concrete evaluations (evals) and training investments (data, RL).
- Some directions are planned; others emerge from user and builder discovery, such as Claude becoming strong at Excel work after a small investment that resonated heavily with financial services customers.
Opus 4.5 was designed from the start to be more efficient, enabling Anthropic to offer Opus-level intelligence at a lower cost via the “effort parameter,” which lets users trade off price versus reasoning depth.
- The team emphasizes that per-token price is a misleading measure of total task cost: smaller or cheaper models may take more tokens or fail entirely, making a more capable model like Opus 4.5 cheaper end to end.

What Opus 4.5 Enables

Opus 4.5 shows a notable jump in complex, iterative, and long-running agent coding tasks, not just one-shot generation.
- Early customers like Shortcut (a spreadsheet agent) reported ~20% accuracy gains without changing their harnesses.
Computer use has matured from an experimental feature to something closer to an end-to-end agent that can operate in web browsers, handle calendar rescheduling, and perform more open-ended tasks.
- Improved vision and reasoning in Opus 4.5 make browser-use interactions significantly better.
The model is also a strong “thinker,” not just a writer: it can spontaneously generate alternative strategies and ideas (e.g., around pricing and positioning), rather than just refining human-provided options.
Early testing highlights improvements in 3D game generation and other visually intuitive tasks, which help people quickly grasp the model’s intelligence bounds.

Product-Market Fit and the Enterprise Agent Debate

Agentic coding is the clearest area of product-market fit, with strong enterprise demand for tools like Claude Code.
- Synchronous agents beyond coding are emerging, but the industry has not yet figured out the right harnesses and product features for many use cases (e.g., web monitoring, personal agents).
Diane pushes back on pessimistic takes (e.g., from Andrej Karpathy and Ilya Sutskever) that enterprise agents are far off or that current paradigms are hitting a wall.
- She argues that progress is jagged, not linear, and that customers like Rakuten and Lovable report real productivity gains.
- Internally, every generation of Claude transforms how Anthropic employees work, which gives her confidence that transformative AI is already underway.
She believes Opus 4.5’s combination of higher intelligence, better context quality, and improved memory will enable proactive, long-running agents that monitor, maintain, and improve systems over time, rather than just responding to chat prompts.

Long-Running Agents and Evaluation Challenges

The next frontier is long-running intelligence: agents that take open-ended responsibility (e.g., maintaining a website, managing a portfolio) with minimal handholding.
- Current evals like SWE-bench are saturated and do not capture the quality of judgment, efficiency, or long-horizon decision-making.
Diane highlights the need for more open-ended evals, citing Vending Bench (where Claude runs a vending machine business) as an early example.
- The goal is to measure not just whether a task is completed, but how much effort or time it takes and whether the model remembers and reasons over long horizons.
She also notes the importance of “model taste”: a hands-on, continuously honed intuition for what models can do, how to push them, and how to build effective scaffolding around them.

The Evolution of Scaffolding

Scaffolding (harnesses, tools, and rules around models) has shifted from “training wheels” (e.g., long lists of do’s and don’ts) to “intelligence augmentations.”
- The best scaffolds now give models generic toolsets, multi-agent orchestration, and lightweight structure that maximizes autonomy.
Anthropic expects some scaffolding to be obsoleted by future models, so they favor thinner, more adaptable harnesses.
- User demands also evolve: as models get better, users give them more complex tasks, which in turn requires updating scaffolds and product experiences.
Diane encourages builders to maintain ambitious prototypes and run regular hackathons to discover what newly possible with each model generation, rather than waiting for obvious use cases to emerge.

Anthropic’s Culture and Key Decisions

Diane describes Anthropic as exceptionally talent-dense, mission-driven, and authentic, with leaders who “walk the walk” on safety and product quality.
- Her role has evolved from hands-on, early-stage work (setting up A/B tests, emailing customers) to coaching a larger team of PMs embedded with researchers.
Two pivotal decisions stand out:
- Focusing early on agentic coding instead of embeddings and RAG, which were the most common user requests in 2023. This was a “user-centric, not user-led” bet on a bigger opportunity.
- Shipping computer use as a beta API despite known limitations, to showcase a new form factor and learn from real-world use, while investing heavily in safety.
A fun highlight: her team built “Golden Gate Claude” (an interpretability demo) from model to UI in less than a day, which went viral internally and externally, showing the company’s ability to move fast and creatively.

Safety as a Competitive Advantage

Diane argues that safety and alignment are not just constraints but actively improve the quality of intelligence.
- A well-aligned model is more of an independent thinker, less prone to sycophancy (telling users what they want to hear), and more likely to push back or offer better alternatives.
- She gives an example where Opus 4.5 proposed a third pricing strategy that was better than the two she had come up with, something a more sycophantic model would not have done.
She believes the industry under-discusses how safety investments can lead to higher-value, more breakthrough-oriented AI.

Looking Ahead

Diane’s timelines for transformative AI have moved up this year based on what she’s seeing with Opus 4.5 and related models.
- She sees the building blocks for long-running, transformative AI as closer than many think, with the main challenge being product and scaffolding design to express that intelligence.
She encourages builders to:
- Maintain ambitious prototypes and test them with every new model release.
- Invest in product experiences that take advantage of new intelligence, rather than just swapping in new models.
- Develop “model taste” through hands-on experimentation and creative problem-solving.
For more information, she points to Anthropic’s blog and website, noting that the company prefers to let the work speak for itself rather than engage in hype or cryptic pre-launch marketing.

Summary

How Anthropic Builds New Models

What Opus 4.5 Enables

Product-Market Fit and the Enterprise Agent Debate

Long-Running Agents and Evaluation Challenges

The Evolution of Scaffolding

Anthropic’s Culture and Key Decisions

Safety as a Competitive Advantage

Looking Ahead