The world of voice AI, with Mati Staniszewski of ElevenLabs
TL;DR
ElevenLabs CEO Mati Staniszewski explains how modern voice AI models predict phonemes with contextual awareness rather than using hard-coded parameters, enabling emergent properties like accents and emotions, while discussing the company's platform strategy and the deployment gap between capable models and consumer applications.
🧠 Technical Architecture 3 insights
Phonemes function as audio tokens
Voice models predict next phonemes (deconstructed syllable sounds) based on previous audio context and textual input, operating similarly to token prediction in LLMs.
Emergent voice characteristics replace hard-coding
Rather than using preset parameters for accents or emotions like early Bell Labs systems, the model deduces Britishness, enthusiasm, or sadness emergently through neural architecture.
Contextual prediction requires dual processing
The model simultaneously processes text construction and audio waveform generation to understand sentence context and render appropriate emotional inflection and prosody.
🏗️ Platform Strategy 3 insights
Horizontal infrastructure focus
ElevenLabs builds foundational models and infrastructure for voice agents, telephony, and creative tools while avoiding domain-specific vertical applications.
Direct deployment ensures model currency
Working directly with enterprises prevents intermediation risks where customers might remain stuck on outdated model versions instead of accessing weekly capability improvements.
Full-stack voice agent infrastructure
The platform combines TTS, STT, and conversational models with knowledge base integration, telephony connections, and safety monitoring for production deployment.
📱 Deployment Gap 3 insights
Consumer deployment lags capability
Despite three years of capable voice technology, integration into cars and phones remains rudimentary due to slow automotive adoption and OS limitations on third-party transcription engines.
ElevenReader fills distribution void
The consumer app enables PDF narration and AI audiobook creation using voices like Sir Michael Caine after traditional distributors blocked AI-generated audio content.
Enterprise adoption precedes automotive
Real-time contextual voice agents are currently deploying in enterprise settings, with in-car offline voice processing expected within 2-3 years.
📊 Data Innovation 2 insights
Specialized audio annotation teams
Built dedicated teams trained specifically to annotate emotional context, accents, and prosody in audio, as generic labelers lacked expertise to describe vocal characteristics.
Internal tools become commercial products
Speech-to-text models initially developed for internal data annotation evolved into commercial offerings, with production feedback loops continuously improving model accuracy.
Bottom Line
Voice AI's immediate opportunity lies not in improving core models but in closing the deployment gap by integrating existing capable technology into consumer devices and enterprise workflows through full-stack platforms.
More from Stripe
View all
The history and future of AI at Google, with Sundar Pichai
Sundar Pichai argues that Google's invention of Transformers and early work on LaMDa positioned it for the AI era, emphasizing that vertical integration—from TPUs to strict latency budgets—enables the company to treat AI as an expansionary force driving search toward agentic workflows rather than a zero-sum threat.
Compliance at scale and why TAM is a distraction, with Christina Cacioppo of Vanta
Christina Cacioppo explains how Vanta turned compliance from a bureaucratic burden into a scalable security platform serving 15,000+ customers, revealing why compliance—not security—is the true forcing function for startup infrastructure and how automated monitoring transforms periodic audits into continuous readiness.
The 20-year journey to fully autonomous cars with Dmitri Dolgov of Waymo
Waymo Co-CEO Dmitri Dolgov details the 20-year technical evolution from Google's self-driving moonshot to 500,000 weekly autonomous rides, explaining why full autonomy requires augmenting end-to-end AI with structured intermediate representations and a 'three teachers' training framework rather than relying solely on scaled-up vision models.
Creating prediction markets (and suing the CFTC) with Tarek Mansour and Luana Lopes Lara
Kalshi founders Tarek Mansour and Luana Lopes Lara recount their four-year battle to launch the first CFTC-regulated prediction market in the US, culminating in a lawsuit against their own regulator to offer election contracts, and why their 'permission-first' approach ultimately enabled $10+ billion monthly volumes.