The world of voice AI, with Mati Staniszewski of ElevenLabs

Stripe's Cheeky Pint 1h 6 min #12
The world of voice AI, with Mati Staniszewski of ElevenLabs
Watch on YouTube

Summary

  • Mati Staniszewski is co-founder of ElevenLabs, an AI audio research and product company valued at $11 billion that has become the leading platform for voice models — spanning text-to-speech, speech-to-text, voice agents, and music. He walks through the technical foundations of how voice models work, the company’s business model and explosive growth, the state of conversational AI, and how ElevenLabs is building an AI-native organization from the ground up.

How voice models work

  • Early approaches tried to mechanically replicate the human vocal tract using analog machines, then progressed to digital signal representations of speech developed at Bell Labs.
  • A key intermediate approach involved stitching together phonemes — the smallest units of human sound — from a library, using probabilistic models to select the most likely next phoneme in sequence.
  • Modern voice models borrow architectural ideas from transformers and diffusion models, predicting the next token in phoneme space while incorporating contextual text to guide how something is spoken.
  • Three core innovations made current models possible:
    • Predicting the next phoneme reliably and quickly, which wasn’t feasible before.
    • Encoding context — understanding whether a sentence is happy, sad, dialect-heavy, etc., and carrying that across the full utterance rather than generating each fragment in isolation.
    • Moving away from hard-coded parameters (e.g., “British accent,” “enthusiastic speaker”) to letting the model deduce voice characteristics, emotion, and style from data in an open-ended way.
  • Phoneme-level vs. text-level operation: Models operate on both simultaneously — phonemes for the audio output, text for understanding sentence construction — which is especially important for real-time streaming use cases like voice agents.
  • Mel spectrograms are a visual representation of speech across pitch and energy, used as an intermediate representation between text and the final waveform output.

What made ElevenLabs’ voices sound more human

  • Beyond architecture, the key differentiator was data quality. Most available audio data lacks proper annotation for who is speaking when, and crucially, how they are speaking (emotion, accent, style).
  • ElevenLabs built an internal team of specialized data labelers trained specifically in voice and audio description, combining semi-automatic and manual annotation techniques.
  • This investment spawned unexpected products: their speech-to-text model was originally built internally to annotate training data, then released as a customer-facing product.
  • Deploying models in production and having customers interact with them generated annotation data that further refined the models over time.

ElevenLabs business model

  • ElevenLabs is a research and product deployment company that builds foundational audio models and a platform for businesses to transform communication with customers and employees.
  • Core product areas:
    • Text-to-speech: Generating speech from text in 100+ languages.
    • Speech-to-speech: Conversational models for voice agents.
    • Speech-to-text: Transcription models supporting 100+ languages.
    • Creative tools: Audiobook narration, marketing voiceovers, brand-aligned narration.
    • Music and other audio domains.
  • The platform layer connects models to knowledge bases, telephony, integrations, and provides evaluation, monitoring, and safeguards for agents in production.
  • Horizontal vs. vertical strategy: ElevenLabs positions itself as a horizontal platform for businesses building voice into their workflows, while expecting vertical-specific application companies to form on top of their infrastructure.
  • Deployment gap: Despite capable models existing for roughly three years, consumer products (car voice control, PDF reading, phone assistants) have been slow to adopt the latest technology. Real-time voice interaction connected to contextual understanding only became viable recently.

The conversational Turing test

  • Text-based LLMs have effectively passed the Turing test, but voice LLMs have not — the orchestration of speech-to-text, turn-taking, LLM reasoning, and text-to-speech into a seamless conversational experience remains unsolved.
  • Key challenges:
    • Knowing when a user has finished speaking vs. pausing mid-sentence.
    • Deciding when to respond immediately vs. waiting vs. asking clarifying questions.
    • Handling tool calls and database lookups gracefully within a voice conversation — what the system should say while fetching information.
  • Voice agents already pass the Turing test in narrow domains like customer support calls, but broader interactive experiences (e.g., gaming with human-like dialogue) remain far off.
  • Speech-to-speech models (direct speech-to-speech without a text intermediate) offer lower latency but are less reliable, less controllable, and harder to debug. ElevenLabs is betting on the cascaded approach (speech-to-text → LLM → text-to-speech) for enterprise use because it provides visibility, reliability, and easier integration.
  • Speech-to-speech may find its niche in companion-style applications where latency matters more than accuracy, and hallucinations may even be acceptable or desirable.
  • An interesting behavioral observation: when ElevenLabs replaced a written signup form with a voice agent, people were more willing to complete it and provided richer, more open-ended information — suggesting voice interaction changes how people engage with systems.

Personalized transcription and voice enhancement

  • ElevenLabs is developing person-specific transcription — fine-tuning on a specific speaker’s voice to achieve superhuman accuracy — expected to ship within weeks at the time of recording.
  • This is especially valuable in healthcare (operating room commands), home devices (listening to one person among family noise), and any setting where knowing the speaker improves accuracy.
  • On the generation side, voice-to-voice transformation (analogous to “touch up my appearance” on Zoom) is now possible: de-accenting, de-mumbling, adjusting enunciation, slowing down speech, or adding dramatic pauses.
  • The V3 model introduced expressive control — users can now specify how something is spoken (slower, more dramatic, etc.), which previously was left entirely to the model’s discretion.

Second-order effects of ubiquitous voice AI

  • Breaking language barriers: Growing up in Poland, Mati experienced poor dubbing where one voice actor played every role. AI dubbing now makes high-quality localization affordable across all languages, not just English-dominant ones.
  • Universal translation: The long-term vision is real-time cross-language conversation — a Babel fish from The Hitchhiker’s Guide to the Galaxy — where you speak your language and the other person hears theirs.
  • Restoring lost voices: People who lost their voice to ALS, throat cancer, or other conditions can have their voice recreated. Notable examples include a Neuralink patient speaking with their own voice again, and a woman who had lost her voice before her wedding being able to speak her vows.
  • Voice agents in the wild: A developer built the “Guinndex” using ElevenLabs, calling pubs across Ireland to collect pint prices — demonstrating both proactive and reactive voice agent use cases.

Economics of voice models

  • Voice models are significantly smaller and cheaper than LLMs and image/video models — typically a few billion to tens of billions of parameters vs. hundreds of billions for leading LLMs.
  • Pricing model: Charged per text token (text-to-speech) or per minute (voice agents and transcription), with annual enterprise agreements offering volume discounts.
  • New model pricing strategy: Newer, more expensive-to-run models are offered at competitive (sometimes subsidized) prices to maximize distribution, let customers discover new capabilities, and generate feedback — even though early reliability may not yet meet enterprise standards.
  • Cascaded vs. fused model size: In the cascaded approach, voice models will likely stay small for speed and reliability. In fused speech-to-speech models combining LLM and voice capabilities, sizes will grow into the tens or hundreds of billions.

Revenue and growth

  • $350 million ARR at the end of 2025, with one quarter alone adding $100 million in net new ARR, putting the company on track for roughly $450+ million ARR.
  • Over 50% of revenue is now sales-led enterprise, with strong land-and-expand dynamics: customers start in one department (e.g., marketing) and expand to customer support, then across the entire organization.
  • Example: Deutsche Telekom started with marketing content (Magenta podcast generation), expanded to customer support, and now has an agent handling calls across their entire network.
  • Self-serve motion drives awareness and distribution, while high-touch deployed engineering works alongside the largest enterprises to customize implementations.
  • Why self-serve matters: Immediate feedback on technology quality, belief that the best product should be available to everyone, and developers/SMBs help discover future product trajectories.
  • Pay-as-you-go billing is being launched for all self-serve users, addressing a common consumer frustration with AI products that hit usage limits without offering an easy overage payment option.

Designing an AI-native organization

  • Small, flat teams: Teams of roughly 10 people, with spans of control of 15+ direct reports — far exceeding the traditional ~8. Both co-founders have 15+ direct reports each.
  • Technical resources embedded in every non-technical team: Ops, talent, finance, and go-to-market teams each have a technical lead or technically skilled person who automates workflows and upskills the rest of the team.
  • Practical examples:
    • Talent: LLM-ifying recruiting data to make pipelines explorable, scraping and analyzing candidate profiles automatically.
    • Go-to-market: AI-generated pre-reads before meetings, customized pitch decks pre-populated with relevant numbers, AI SDR voice agents for lead capture.
    • Culture: A voice agent that prospective candidates can speak with to learn about company culture and prep for interviews.
  • Hiring philosophy: Filter for agency above all — the drive to explore, take ownership, and act regardless of seniority. High-agency people are the primary beneficiaries of AI advances; low-agency people risk being left behind.
  • Culture as the scaling mechanism: The founders’ proudest achievement is that culture, not any individual or product, is what now builds the company as it scales to 470+ people.
  • Validation from Ukraine: The Ukrainian government’s DIIA platform (citizen services app) uses a similar model — every ministry has its own technical resources building agentic versions of their work, assembled by a central digital transformation team — confirming that embedding technical capability across teams is a broadly effective pattern.
Back to Stripe's Cheeky Pint