The world of voice AI, with Mati Staniszewski of ElevenLabs
TL;DR
ElevenLabs CEO Mati Staniszewski explains how modern voice AI models predict phonemes with contextual awareness rather than using hard-coded parameters, enabling emergent properties like accents and emotions, while discussing the company's platform strategy and the deployment gap between capable models and consumer applications.
๐ง Technical Architecture 3 insights
Phonemes function as audio tokens
Voice models predict next phonemes (deconstructed syllable sounds) based on previous audio context and textual input, operating similarly to token prediction in LLMs.
Emergent voice characteristics replace hard-coding
Rather than using preset parameters for accents or emotions like early Bell Labs systems, the model deduces Britishness, enthusiasm, or sadness emergently through neural architecture.
Contextual prediction requires dual processing
The model simultaneously processes text construction and audio waveform generation to understand sentence context and render appropriate emotional inflection and prosody.
๐๏ธ Platform Strategy 3 insights
Horizontal infrastructure focus
ElevenLabs builds foundational models and infrastructure for voice agents, telephony, and creative tools while avoiding domain-specific vertical applications.
Direct deployment ensures model currency
Working directly with enterprises prevents intermediation risks where customers might remain stuck on outdated model versions instead of accessing weekly capability improvements.
Full-stack voice agent infrastructure
The platform combines TTS, STT, and conversational models with knowledge base integration, telephony connections, and safety monitoring for production deployment.
๐ฑ Deployment Gap 3 insights
Consumer deployment lags capability
Despite three years of capable voice technology, integration into cars and phones remains rudimentary due to slow automotive adoption and OS limitations on third-party transcription engines.
ElevenReader fills distribution void
The consumer app enables PDF narration and AI audiobook creation using voices like Sir Michael Caine after traditional distributors blocked AI-generated audio content.
Enterprise adoption precedes automotive
Real-time contextual voice agents are currently deploying in enterprise settings, with in-car offline voice processing expected within 2-3 years.
๐ Data Innovation 2 insights
Specialized audio annotation teams
Built dedicated teams trained specifically to annotate emotional context, accents, and prosody in audio, as generic labelers lacked expertise to describe vocal characteristics.
Internal tools become commercial products
Speech-to-text models initially developed for internal data annotation evolved into commercial offerings, with production feedback loops continuously improving model accuracy.
Bottom Line
Voice AI's immediate opportunity lies not in improving core models but in closing the deployment gap by integrating existing capable technology into consumer devices and enterprise workflows through full-stack platforms.
More from Stripe
View all
10 Years of Stripe France: The tech renaissance and whatโs next
French tech leaders reflect on the ecosystem's transformation from early 2000s corporate culture to today's AI-driven renaissance, highlighting how reduced capital barriers and improved infrastructure are reshaping entrepreneurship.
Stripe Sessions 2026 | Keynote
Stripe Sessions 2026 marked the company's most ambitious product launch day in history, centered on building economic infrastructure for the AI era. The keynote revealed a parabolic spike in new business formation since January 2026 and introduced tools including the Machine Payment Protocol, Link wallet for agents, and Stripe Projects to enable autonomous agent-to-agent commerce.
Sam Altman in conversation with Patrick Collison
Sam Altman discusses the recent 'parabolic' inflection in AI capabilities, particularly for coding with GPT 5.5 and Codex, while outlining OpenAI's evolution into a massive-scale 'intelligence utility' provider focused on automating general computer work through agents like OpenClaw.
Nat Friedman and Daniel Gross in conversation with John and Patrick Collison
Nat Friedman and Daniel Gross describe the current era as the 'slow part' of the singularity, predicting that AI will drive massive economic shifts, force continuous security hardening, and enable a new golden age of personal hardware tinkering where AI agents reverse engineer proprietary systems in hours.