A Deep Dive into the Future of Voice in AI — Unsupervised Learning

Russ d’Sa, co-founder and CEO of LiveKit, discusses the future of voice AI, how LiveKit powers applications like ChatGPT Voice, and why voice and multimodal interfaces will increasingly replace keyboards and mice. LiveKit provides the real-time audio/video infrastructure that connects users’ devices to AI agents, and d’Sa argues that as AI models become more human-like, our interfaces with them will naturally shift toward the same modalities humans use with each other: speech, vision, and touch.

LiveKit acts as the “nervous system” for AI, transporting sensory input (audio, video) from a user’s device to an AI “brain” (like GPT-4o) and returning the output back to the user in real time.
- A LiveKit SDK on the user’s device captures microphone and camera input.
- Audio is transmitted through LiveKit’s global edge network mesh to an agent running on the backend.
- In traditional voice mode, audio goes through speech-to-text, then to an LLM, then text-to-speech on the way back.
- With GPT-4o’s realtime API, audio embeddings go directly into the model and speech comes back out, bypassing the text conversion step entirely.
The nervous system analogy: foundational model companies (OpenAI, Anthropic, Google) are building the “brain”; LiveKit builds the infrastructure that connects that brain to the senses — microphones, cameras, speakers — and carries signals back and forth.

His favorite use case is learning during commutes — he puts in one AirPod and asks Advanced Voice Mode to teach him about frontier tech topics (quantum computing, how lightning works, etc.).
He describes it as having a judgment-free tutor with the world’s knowledge, where you can ask any “dumb question” without embarrassment.

The office of the future is uncertain — it could be a desk at home, or being “always in the office” via phone and agents. The nature of work itself will change drastically.
Creative tools will become voice-driven and multimodal, with the human acting as the orchestrator or “maestro” while the AI handles the mechanical work. He points to Tony Stark’s interactions with J.A.R.V.I.S. as a model for this.
Copilots vs. agents will be a hybrid, mirroring how humans work with each other — sometimes pairing closely, sometimes delegating autonomous tasks.
Text will persist alongside voice and vision — for example, when ordering at a new restaurant, you wouldn’t want the AI to read the entire menu aloud; a hybrid of voice plus on-the-fly generated UI (buttons to tap) makes more sense.
The “thin client” dream may be realized through chat interfaces — a single, familiar UI (text box + message stream) that incorporates voice, generated UI, and text, rather than thousands of separate apps with different designs.
Current AI interfaces are still mode-based (voice vs. text vs. code editor), but the future will blend modalities fluidly, like pair programming where you look at a screen, type, ask questions, and let the other person take the keyboard — all mixed together in real time.

Consumer-facing voice interfaces from OpenAI, Gemini Live, Character.AI, and Perplexity are pushing the envelope — tutoring, therapy, information lookup — but these are still “emergent” at relatively small scale.
Telephony is the near-term high-penetration opportunity — billions of calls happen monthly, and AI is rapidly entering spaces like customer support, insurance eligibility lookups, and any IVR/phone-tree system. Companies like Sierra and Parloa are already disrupting this space.
Latency is no longer the main blocker — it has dropped from ~4 seconds in early 2023 to ~320 milliseconds with GPT-4o and LiveKit, which is near the ~300ms human conversational threshold. In some cases (e.g., Cerebras inference at ~100ms), models respond too fast and feel unnatural.
The real challenge is systems integration — customer support AI needs to update backend systems (Salesforce, custom ticketing software), and many of these systems are bespoke and hard to integrate with. Human-in-the-loop is still necessary because models aren’t perfect yet.

LiveKit built a cloud browser infrastructure (headless Chrome instances) that agents can control via Playwright — loading pages, clicking buttons, filling forms.
When the agent gets stuck, it can stream the browser as video to a human user, who can click on the video pixels to unblock it. The clicks are replayed back in the cloud, creating a shared interactive session.
This gives AI the “ability to touch” — the last major sense after sight, hearing, and speech — enabling AI to manipulate applications on a consumer device the way humans do with touch events.

Insurance eligibility lookups are a massive, largely invisible workload — hospitals call insurers millions of times to verify coverage, and AI systems are starting to automate these calls, including outbound AI-to-human calls.
GPT-4o’s fully multimodal training (joint text and speech embeddings, accepting and outputting any combination of modalities) was a landmark moment d’Sa had been waiting for.

Humanoid robotics illustrates the split clearly: planning and reasoning happen in the cloud, but reflex actions, kinematics, and real-time movement must run on-device (you can’t wait for a cloud round-trip when a car is coming).
The human analogy: no person holds all the world’s knowledge in their head — we “do inference in the cloud” by looking things up on our phones or calling experts. Local models handle what they can; cloud models handle the rest.
In an ideal, resource-unconstrained world, you’d parallel-path both — local and cloud models inferring simultaneously, with the fastest correct responder winning.
Even on-device AI sends data to the cloud for logging, legal records, training data generation, and error correction. Privacy-sensitive use cases may stay local, but even Apple Intelligence uses a secure cloud.

Overhyped: Transformers. Underhyped/under-researched: Spiking neural networks — analog-style networks modeled more closely on how real neurons interact, potentially ideal for audio and video signals, though harder to train.
What he’s changed his mind on: a year ago he thought applications would have real moats within 6–18 months; now he believes models change so fast that the only viable strategy is being deeply embedded with customers and building extremely fast.
AI startup he’s most excited about (besides LiveKit): Tesla — after using its API to build a voice-controlled car demo and experiencing full self-driving from SF to Menlo Park, he called it a “marvel of technology” and “sci-fi dreams.”
If he had to start an AI application tomorrow: a video game with deeply interactive, voice-driven NPCs in an open world with dynamic, lifelike characters and infinite story permutations — he called voice AI in video games potentially underhyped.
Where to learn more: github.com/livekit (most of their work is open source), livekit.io, and x.com/livekit.

Summary