The Product Playbook of a Billion-Dollar AI Company

Unsupervised Learning 57min 7 min #4
The Product Playbook of a Billion-Dollar AI Company
Watch on YouTube

Summary

  • Intercom moved from ChatGPT’s launch to shipping its AI support bot Fin in roughly four months, treating AI as an existential priority for customer support rather than a feature add-on.

    • When ChatGPT launched in late November 2022, Intercom’s VP of AI messaged the co-founder Dez Traynor that evening, and by the next morning leadership was already discussing ripping up the entire AI/ML roadmap to go all in.
    • The company shipped an initial release before Christmas, had a limited release in January, launched Fin to some customers in March, and broadened availability in July.
    • Dez described customer support as being “in the kill zone” of AI because large language models are inherently conversational, can look up facts, read and summarize — essentially doing the undifferentiated parts of a support rep’s job out of the box.
  • Intercom’s AI product suite evolved in stages, starting with low-risk features and progressing to a full AI agent.

    • Stage 1 — Zero-downside inbox features (weeks after ChatGPT): Summarize conversation, translate message, expand or collapse text. These were safe because users could simply ignore the button if the output wasn’t useful. Summarization immediately proved valuable because summarizing issues into tickets is a real support workflow.
      • Cost was an immediate constraint: Intercom handles ~500 million conversations per month, so auto-summarizing all of them would have been prohibitively expensive. This forced them to put a manual button in the UI rather than making it automatic.
    • Stage 2 — Fin chatbot (after GPT-4 beta access): A user-facing bot that answers customer questions based on a high confidence threshold. GPT-4 was the first model where they felt they could contain hallucinations enough to trust it in production.
      • Significant engineering went into guardrails: staying on topic, not giving political opinions, not recommending competitors, and prioritizing provided context over the model’s pretrained knowledge.
    • Stage 3 — Inbox AI enhancements: Tone-of-voice matching, suggested replies, and other agent-augmentation features built on top of the original inbox AI set.
    • Stage 4 (upcoming): Fin by email, EU data residency support, and a significant expansion of what Fin can do inside the inbox.
  • Guardrails and hallucination prevention rely on exhaustive scenario testing rather than any single technical fix.

    • Intercom built a “tortorture test” — a large set of scenarios covering both desired and undesired behaviors — that they run across every model they evaluate (GPT-3.5, GPT-4, Claude, Llama, etc.).
    • Key tensions: constraining the model reduces hallucinations but also reduces creativity and can cause it to miss correct answers it would have given. The error bars shrink in both directions.
    • Prompt engineering is central: explicitly telling the model to prioritize provided documentation over pretrained knowledge, resolve conflicts between sources with specified weighting, and stay within defined boundaries.
    • Model selection is multi-factorial: trust, cost, reliability, stability, uptime, malleability (how much you can control it), and speed. Speed is the one Dez would most want to improve if given a magic wand.
  • Intercom is still in exploration mode, not cost optimization, though cost awareness shapes what they build.

    • Dez estimates model costs have dropped roughly 10x since ChatGPT launched, but they still can’t afford to run expensive models on every conversation by default.
    • They haven’t yet moved workloads from GPT-4 to cheaper models (GPT-3.5 or open-source) because the AI team’s time is better spent exploring new capabilities.
    • He expects cost optimization will become necessary when model improvements plateau — when GPT-7 is only incrementally better than GPT-6 — and that’s when they’ll focus on making the economics work.
    • Latency is a bigger near-term forcing function than cost: most AI interactions still feel slow, like “modem internet days.” Dez expects on-device LLMs (potentially from Apple) will eventually make AI feel instant.
  • The AI team structure at Intercom is centralized but works closely with product engineering.

    • A central AI/ML team of ~17-20 people (data scientists, ML engineers) handles model building, training, and evaluation.
    • Around 150 regular product engineers build on top of the endpoints and capabilities the AI team creates.
    • Dez distinguishes three types of companies: (1) AI research companies pushing the bleeding edge, (2) AI-first startups whose existence depends on models from OpenAI/Anthropic, and (3) companies that apply AI as “salt and pepper” — summarization, magic wand features. Only the first two categories need dedicated data scientists and deep ML expertise.
  • AI projects require a different management approach than traditional software because feasibility is uncertain and binary.

    • Traditional software projects front-load risk in the design phase (exploring ideas in Figma) and then execute with confidence. AI projects add a second phase: “is any of this even possible?” — and you may never get a clean answer.
    • Dez recommends treating AI initiatives as a portfolio of bets with varying probabilities. Some (like text expansion) are 99% likely to work; others (like using generative AI to create editable vector graphics) might be 20-40%, and you won’t know until you’ve spent significant time.
    • Example of something not yet solved: sentence completion for support agents. While it seems like it should work given that the model can answer questions directly, the challenge is distinguishing personal answers from PII, and abstracting away irrelevant context to focus on the task. It may not be a model limitation — it may just be a hard problem.
  • Intercom uses a “crawl, walk, run” approach to rolling out AI to customers, starting with low-risk segments.

    • Common entry points: free users (who won’t churn and aren’t paying), weekend-only usage, or only when specific keywords are present.
    • The pattern they observe: customers start skeptical and restrictive, then quickly notice their free users are getting better support than their paid users (instant, mostly correct answers), and then flip to wanting to go all in.
    • Fin has delivered over 2 million answers and is used by thousands of customers. It can handle complex multi-part questions with structured responses.
    • Dez believes broad enterprise adoption will accelerate when Apple and Google ship LLM-powered consumer products (Siri, Bard), normalizing the experience of talking to software and getting what you want — similar to how the iPhone normalized good design in B2B software.
  • Currently Intercom uses RAG (retrieval-augmented generation) rather than per-customer fine-tuning, though fine-tuning for tone of voice is done through prompting.

    • Finn reads a customer’s documentation and past support conversations, which means it naturally picks up the brand’s tone of voice from existing content.
    • About 30-40% of the AI team works in an internal AI lab on exploratory projects (custom models, training on support data) rather than customer-facing product.
    • Product roadmap is largely customer-driven: EU data residency and email support were the most requested features.
  • Looking ahead, Dez estimates AI will handle 100% of support in some verticals but not others.

    • E-commerce support is highly concentrated: ~5 question types account for ~95% of volume (where’s my order, stock, returns, refunds, coupons). These are nearly fully automatable.
    • Complex products like Google Docs generate thousands of diverse query types, making full automation harder — maybe 80-90%.
    • Simple products with few features (like a notes app) could reach 100% automation.
    • The next frontier is actions, not just text: issuing refunds in Stripe, canceling orders, reissuing items. This is a hard engineering problem (authentication, monitoring, logging, special casing) more than a model capability problem. Two possible paths: hand-coding all conditions or having the AI read API documentation and figure it out.
    • Some customers may want a human-in-the-loop model where Finn proposes actions (cancel order, issue credit, send apology email) and a human manager approves or rejects — similar to how self-checkout requires human approval for alcohol purchases.
  • For startups building with AI, Dez advises targeting areas where incumbent tech stacks are irrelevant.

    • If an incumbent would rebuild their product entirely differently with AI and none of their existing features or UI would survive, that’s a good startup opportunity.
    • If the incumbent has a massive tech stack advantage (like MailChimp/Clavio in email delivery — reputation, deliverability, spam compliance), the startup’s AI feature (email design) is just a small add-on that the incumbent can copy within a year.
    • For incumbents: break the product into workflows, then for each workflow ask whether AI can reliably do it above the customer’s accuracy threshold. If yes, remove the workflow entirely. If AI can augment or reduce it to a simple decision, do that. Only add “sprinkling” AI (summarization, magic wand) as a last layer.
  • Dez’s views on what’s overhyped and underhyped in AI.

    • Overhyped: Productivity tools that write emails, sales pitches, and other business content. People will learn to detect AI-generated content, and filters will catch it. “People have forgotten what writing is and what good writing actually is.”
    • Underhyped: AI’s impact on creativity. Just as Instagram’s simple filters made everyone think they were photographers, tools like Midjourney, Runway, Stable Diffusion (image), Udio (music), and Synthesia (video) are creating a new type of creativity that we don’t yet fully understand.
    • Impressed by incumbents: Adobe (moved fast), Figma and Miro (found genuinely useful applications rather than slapping AI on the homepage), Notion.
    • Disappointed by: Apple and Amazon — Siri and Alexa remain primitive compared to ChatGPT, and it’s jarring to explain to his daughter why his phone can generate a custom 10-minute bedtime story but Alexa can’t understand “play something from the earlier album.” He hopes 2024 brings a leveling out.
  • On the open source vs. closed source model landscape.

    • Meta’s announcement of massive compute investment (reportedly ~$10.5 billion across 350,000 GPUs) and commitment to open-source Llama 3 reinforces that open-source models will remain competitive with closed-source ones.
    • Open-source models tend to be about half a generation behind closed-source, but with Meta’s level of commitment, they seem likely to stay in the race.
    • On the closed-source side, it’s unclear whether Anthropic has raised enough capital to compete at Meta’s compute scale, though the funding landscape changes quickly.
Back to Unsupervised Learning