- Uber has undergone a major shift from using AI primarily for customer-facing systems (like matching and pricing) to embedding agentic AI deeply into the engineering lifecycle, with measurable productivity gains, new tooling, and significant organizational and cost challenges.
- This shift was driven by CEO Dara Khosrowshahi declaring AI one of Uber’s six strategic priorities, framing it as a move from a “human + early AI-powered company” to a “generative/agentic AI-powered company.”
- The core philosophy is augmentation, not replacement: freeing engineers from repetitive toil (library migrations, bug fixes, dead code cleanup, documentation) so they can focus on creative, business-growing work.
- The technology evolution—from pair programming tools like GitHub Copilot (10–15% velocity bump) to asynchronous peer programming with agents that run for hours—enabled a paradigm shift where developers act as tech leads directing multiple background agents simultaneously.
- As soon as agentic workflows were made available, 70% of tasks submitted were toil work, because the accuracy on well-defined tasks (migrations, upgrades) was much higher than on ambiguous feature development, creating a virtuous cycle of adoption.
Uber’s Agentic Platform Architecture
- Uber built its agentic platform on top of existing internal infrastructure rather than relying solely on external vendors, enabling speed, security, and deep organizational integration.
- Michelangelo platform: Uber’s long-standing ML platform provides a model gateway (proxying frontier models from OpenAI, Anthropic, etc.), inference/training infrastructure, and has evolved to support agentic APIs.
- Context sources: Agents are connected to source code, engineering docs, Jira tickets, Slack, and other internal systems to give them organizational memory.
- MCP (Model Context Protocol) gateway: A cross-functional tiger team built a central MCP gateway that proxies both external and internal MCPs with consistent authorization, telemetry, and logging. A registry and sandbox let developers discover and test MCPs safely.
- Agent builder infrastructure: SDKs and no-code solutions let teams across Uber build agents that are discoverable through a registry and deployable consistently across dev pods, local laptops, the Minion background agent platform, or production.
- AIFX CLI: A command-line tool that serves as the primary interface for engineers to provision agent clients, install and configure MCPs, manage settings, and connect to background task infrastructure.
Minion: Uber’s Background Agent Platform
- Minion is Uber’s internally built background agent platform that runs asynchronously on Uber’s own CI infrastructure, addressing limitations of vendor-hosted background agents.
- It runs on Uber’s CI platform with monorepos pre-checked out, handles network access to internal services, and connects to MCP servers through AIFX.
- It is accessible through a web interface, Slack, GitHub PRs, the CLI, and APIs—allowing integration into other workflows.
- It provides smart defaults per monorepo (well-written prompt templates, codemod setup, context) to increase task success rates compared to running agents locally without that scaffolding.
- A prompt improver feature analyzes user prompts before execution and suggests improvements, flagging low-quality prompts with a visual indicator to reduce wasted runs.
- Demo example: A developer pasted a user-reported crash error into Minion, selected a template, chose Cloud Code as the agent, and 7 minutes later received a Slack notification with a completed PR—including a test plan and co-authorship by the Minion bot—dramatically reducing context-switching.
Managing the Review Burden: Code Inbox and U Review
-
As agent-generated code volume surged, code review became a bottleneck. Uber built two internal tools to address different parts of the problem.
-
Code Inbox: A unified, noise-reducing inbox for PRs requiring review.
- Uses smart assignment based on code ownership, compliance requirements, reviewer history, time zone, and calendar availability.
- Enforces strict SLOs for review times with automatic reassignment and escalation.
- Batches Slack notifications, respects focus time and holidays, and integrates into existing team Slack workflows.
- Analyzes change risk (surface area, blast radius, service criticality) so reviewers can apply appropriate scrutiny.
-
U Review: An AI-powered code review pipeline that generates high-quality review comments.
- A preprocessor runs plugins including defect bots, best practice checkers, and MCP-connected organizational knowledge.
- An API layer lets external review tools (CodeRabbit, Graphite, etc.) plug in alongside internal bots to minimize duplicate comments.
- A review grader filters out low-value “nit” comments, surfacing only high-confidence, actionable feedback.
- Each layer uses different models selected by evaluation performance; the system matured over most of 2025, producing higher-quality comments at higher rates while maintaining a high rate of developer follow-through (confirming the comments are actually useful, not noise).
-
Test Generation: Autocover
- Autocover is Uber’s custom test generation agent built on the internal LangFX SDK (based on LangChain), producing significantly higher-quality tests than generic agents.
- Generates approximately 5,000 tests merged per month across the company, at roughly 3x the quality of tests from generic agents.
- Includes a critic engine that validates test quality, which was separated into an independent test validator that developers can use for both AI-generated and human-generated tests.
- This addresses concerns about bad-quality “change detector” tests inflating coverage metrics without providing real confidence.
Large-Scale Code Migration: Auto Migrate and Shepherd
- Inspired by how companies like Google and Meta handle large-scale changes, Uber launched a program called Auto Migrate with four components: problem identification, code transformation, validation, and campaign management.
- Code transformation can use deterministic tools like OpenRewrite or AI agents like Minion.
- Validation relies on CI, unit tests, and sometimes staging/production signals rather than solely human review.
- Shepherd is the campaign management platform that orchestrates large migrations.
- A web UI and YAML-based configuration let migration authors define prompts or scripts.
- Shepherd auto-generates PRs, refreshes them on a defined cadence, notifies the right code owners, and integrates with Code Inbox for review queuing.
- Demo 1 (deterministic): Shepherd used OpenRewrite to generate PRs upgrading Java services to Java 21, correctly scoping each PR to the relevant code owners.
- Demo 2 (agent-based): Performance analysis tools identified issues, Minion generated diffs, and Shepherd managed the resulting PRs with standardized verification and review instructions.
Non-Technical Challenges
- Technology churn vs. organizational commitment: The AI landscape shifts rapidly (new models, new vendors), but Uber’s engineering investments require committing dozens of people for months. Mitigation strategies include building abstraction layers so models and tools can be swapped without rewriting everything, and cultivating a culture that accepts today’s custom tool may be replaced by a better vendor solution tomorrow.
- Example: If Cursor builds test coverage natively, Uber’s Autocover may become obsolete—and that’s acceptable if the outcome is delivered.
- Legacy infrastructure and adoption friction: Integrating 10–15 years of heterogeneous code (sophisticated and archaic) into AI-accessible endpoints like MCPs is technically difficult. Even when the technology works powerfully (e.g., VPs landing code for the first time in years during a 24-minute demo), adoption has been slower than expected because it requires developers to fundamentally change how they work.
- Top-down mandates had limited success; the more effective approach has been peer-driven adoption—sharing wins through key promoter engineers, because engineers trust other engineers more than leadership directives.
Measurement and Cost
- Activity metrics are strong but don’t prove business impact: Developer NPS and self-reported productivity are at all-time highs. The gap between casual and power users (20+ days/week) has widened dramatically since the Minion platform and strong models (Sonnet, Opus) launched. However, these are activity metrics, not business outcomes.
- The CFO has asked for revenue impact, not diff counts. This remains an unsolved measurement problem.
- Uber’s current approach is to instrument the feature development pipeline (design → experiment launch) to measure how AI accelerates time-to-production.
- Costs have increased 6x since 2024, moving from self-fundable to requiring CFO approval. GPU and memory costs are the primary drivers.
- Uber is optimizing by routing tasks to appropriate models: higher-reasoning models for planning, lower-cost models for execution, with the infrastructure making these choices transparently to reduce developer friction.
- New tool introductions (JetBrains AI, Warp) add further cost complexity and require ongoing evaluation and adjustment.