Ramp: Lessons from Building a New AI Product - The Pragmatic Summit

The Pragmatic Engineer 36min 5 min #72
Ramp: Lessons from Building a New AI Product - The Pragmatic Summit
Watch on YouTube

Summary

  • Ramp’s AI-first transformation: Ramp is a finance platform with 50,000+ customers that automates expense management, accounting, and financial workflows using AI agents. The company has shifted from building many narrow AI tools to a single agent architecture with many skills, centered around a conversational UX called OmniChat. The policy agent — one of Ramp’s most popular AI products — is a case study in how to build, iterate, and deploy AI agents that reason over receipts, transaction data, and company expense policies to autonomously approve or reject expenses.

The policy agent: what it does and why it matters

  • The problem it solves: Finance teams at large companies review hundreds or thousands of receipts daily, manually checking whether each transaction complies with company expense policy. This is tedious, error-prone, and doesn’t scale.
  • How it works: The policy agent uses LLMs to reason over receipt images and transaction data — extracting details like number of guests, merchant, amount, and timing — then compares those against the company’s written expense policy to approve, reject, or flag transactions.
    • Example: It correctly identified 8 guests on a receipt, verified the per-person amount was under the $80 cap, and approved a team welcome dinner.
    • Example: It rejected a $3 bakery charge because it wasn’t an overtime purchase and didn’t occur on a weekend.
  • The key insight: Instead of hard-coding deterministic rules for every policy variation, Ramp treats the company’s natural-language expense policy document itself as the ruleset — “English is the new programming language” — and lets the LLM reason over it.

How they built it: start small, iterate fast

  • Started with a constrained problem: Rather than trying to automate all finance reviews at once, the team started with the simplest possible case — deciding whether a coffee purchase should be approved or rejected. These are low-risk, single-dollar-amount transactions.
  • Progressive complexity: The system evolved through clear stages:
    1. Simple pipeline: retrieve context → pass through a series of well-defined LLM calls → output approve/reject with reasoning.
    2. Conditional prompting: classify expense type (travel, meal, entertainment) first, then retrieve relevant context and apply specialized logic.
    3. Full agentic workflow: complex tools that can read across the entire Ramp platform, shared across all agents via an internal toolbox, with the ability to write decisions and auto-approve expenses in a loop.
  • The tradeoff: As systems grow more capable and autonomous, they become less traceable and explainable — a “bigger black box.” The team accepts this but invests heavily in auditability from day one.

Ground truth labeling and evals

  • Weekly cross-functional labeling sessions: Engineers, PMs, and finance experts met weekly to label 100 data points, creating a ground truth dataset. This had two major benefits:
    • A reliable test set to measure agent performance.
    • Shared understanding across functions of what “correct” means, reducing miscommunication.
  • Built their own labeling tool: After finding third-party tools too specific or too general, the team built an internal tool using Streamlit in a single session. It’s low-maintenance, deploys in seconds, and non-engineers can modify it themselves.
  • Evals strategy:
    • Started with just 5 test cases, grew over time — don’t let perfectionism block early progress.
    • Made evals easy to run (anyone can run a command) and easy to interpret.
    • Integrated into CI so regressions are caught before merge.
    • Online evals track real-world health metrics, like the rate of “unsure” decisions (agent lacked enough information).
    • Evals enable confident model upgrades — when a new model drops (e.g., Opus 4.6, GPT-5.3), the team can benchmark quickly and switch with a config change.

Building trust with customers

  • Started with suggestions, not auto-approvals: Even with large enterprise design partners, the team initially only provided recommendations, not autonomous actions.
  • Autonomy slider: Customers who gained trust over time could opt into auto-approvals for transactions under a threshold (e.g., anything under $200).
  • In-product feedback loops: Users can update their expense policy document directly and see changes reflected immediately. This feedback loop builds trust and lets customers personalize the agent’s behavior.
  • Cultural lesson: AI products cannot be “one-shotted.” Teams must align early that day-one imperfection is acceptable, and iteration is the path to quality.

Infrastructure: the applied AI surface

  • Centralized LLM proxy service: Ramp built an internal service (similar to LiteLLM) that abstracts away model provider differences, giving product teams three key capabilities:
    • Structured output and consistent SDKs across model providers (GPT, Claude, Gemini, etc.).
    • Batch processing and workflow handling for evals and bulk data analysis, with rate limit management built in.
    • Cost tracing across teams and products, enabling Pareto analysis of model performance vs. cost.
  • One-line model switching: When a new frontier model is released, a single config change propagates across all downstream SDKs. Teams don’t need to update dozens of call sites.
  • Internal tool catalog: Hundreds of reusable tools (get policy snippet, per diem rate, recent transactions, etc.) built alongside product teams. These can be used across agents and in new product prototypes without rebuilding from scratch.

Ramp Inspect: internal background coding agent

  • The problem it addresses: Engineers and cross-functional teams are fragmented across DataDog, Slack, Notion, incident.io, production databases, and tribal knowledge. AI coding tools like Codex or Cloud Code don’t have access to this context.
  • What Ramp Inspect does: A background coding agent that integrates with all internal context sources — logs, databases, Slack threads, Notion docs, CI — and can autonomously fix bugs, make copy changes, write queries, and open PRs.
    • Spins up fast, isolated Modal code sandboxes with the same environment as Ramp’s production setup.
    • Multiplayer-first design: non-engineers (PMs, design, legal, marketing, CX) can pair with it, give feedback, and level up their prompting skills.
    • Can be triggered via Kanban UI, API, or Slack thread (ingesting full conversation context).
    • Full VS Code environment with VNC, Chrome DevTools, and MCP for full-stack work.
    • Access to 150,000+ tests; can detect and patch CI failures before notifying the user.
  • Impact: Over 50% of PRs merged to production now go through Ramp Inspect. Usage spans engineering, product, design, risk, legal, corporate finance, marketing, and CX.
  • Open sourced: The blueprint is published at builders.ramp.com, and an open-source implementation is available on GitHub as “open inspect.”

Culture shift: what separates great teams in the AI era

  • The real bottleneck was never coding speed: A Harvard study on AI and engineering hiring trends suggests the impact isn’t just about junior vs. senior — it’s about the qualities that distinguish high-impact teams.
  • Team A (high-impact): Cares about impact, handles ambiguous problems, understands product/business/data, adopts new tools, finds creative solutions, obsesses over user experience.
  • Team B (low-impact): Debates libraries, adds process when things feel chaotic, complains about headcount, bikesheds details, builds before understanding the problem, focuses on performative code quality.
  • What becomes more important with AI:
    • Figuring out what to build and understanding users deeply.
    • Selling ideas to skeptical stakeholders.
    • Making good design decisions with incomplete information.
    • Maintaining momentum through the long middle of a project — where most AI hype glosses over the hard work of achieving product-market fit.
  • The risk: AI lets you build the wrong thing faster and create bigger messes. Judgment, context, and scar tissue matter more, not less.

Where this leads

  • Software is perpetually unfinished — unlike factory work or farming, there’s always more to build. AI creates extra capacity that gets redirected, not used to shrink teams.
  • Four expected outcomes:
    1. Companies chase opportunities they couldn’t previously afford to pursue.
    2. Companies enter adjacent markets, stitching together more customer value.
    3. Companies rebuild systems that were too expensive to touch before.
    4. The bar for “good enough” rises — more mind-blowing user experiences become the expectation.
Back to The Pragmatic Engineer