The world of voice AI, with Mati Staniszewski of ElevenLabs

| Podcasts | April 14, 2026 | 4.15 Thousand views | 1:00:33

TL;DR

ElevenLabs CEO Mati Staniszewski explains how modern voice AI models predict phonemes with contextual awareness rather than using hard-coded parameters, enabling emergent properties like accents and emotions, while discussing the company's platform strategy and the deployment gap between capable models and consumer applications.

🧠 Technical Architecture 3 insights

Phonemes function as audio tokens

Voice models predict next phonemes (deconstructed syllable sounds) based on previous audio context and textual input, operating similarly to token prediction in LLMs.

Emergent voice characteristics replace hard-coding

Rather than using preset parameters for accents or emotions like early Bell Labs systems, the model deduces Britishness, enthusiasm, or sadness emergently through neural architecture.

Contextual prediction requires dual processing

The model simultaneously processes text construction and audio waveform generation to understand sentence context and render appropriate emotional inflection and prosody.

🏗️ Platform Strategy 3 insights

Horizontal infrastructure focus

ElevenLabs builds foundational models and infrastructure for voice agents, telephony, and creative tools while avoiding domain-specific vertical applications.

Direct deployment ensures model currency

Working directly with enterprises prevents intermediation risks where customers might remain stuck on outdated model versions instead of accessing weekly capability improvements.

Full-stack voice agent infrastructure

The platform combines TTS, STT, and conversational models with knowledge base integration, telephony connections, and safety monitoring for production deployment.

📱 Deployment Gap 3 insights

Consumer deployment lags capability

Despite three years of capable voice technology, integration into cars and phones remains rudimentary due to slow automotive adoption and OS limitations on third-party transcription engines.

ElevenReader fills distribution void

The consumer app enables PDF narration and AI audiobook creation using voices like Sir Michael Caine after traditional distributors blocked AI-generated audio content.

Enterprise adoption precedes automotive

Real-time contextual voice agents are currently deploying in enterprise settings, with in-car offline voice processing expected within 2-3 years.

📊 Data Innovation 2 insights

Specialized audio annotation teams

Built dedicated teams trained specifically to annotate emotional context, accents, and prosody in audio, as generic labelers lacked expertise to describe vocal characteristics.

Internal tools become commercial products

Speech-to-text models initially developed for internal data annotation evolved into commercial offerings, with production feedback loops continuously improving model accuracy.

Bottom Line

Voice AI's immediate opportunity lies not in improving core models but in closing the deployment gap by integrating existing capable technology into consumer devices and enterprise workflows through full-stack platforms.

More from Stripe

View all
The history and future of AI at Google, with Sundar Pichai
1:09:33
Stripe Stripe

The history and future of AI at Google, with Sundar Pichai

Sundar Pichai argues that Google's invention of Transformers and early work on LaMDa positioned it for the AI era, emphasizing that vertical integration—from TPUs to strict latency budgets—enables the company to treat AI as an expansionary force driving search toward agentic workflows rather than a zero-sum threat.

8 days ago · 9 points
The 20-year journey to fully autonomous cars with Dmitri Dolgov of Waymo
1:02:33
Stripe Stripe

The 20-year journey to fully autonomous cars with Dmitri Dolgov of Waymo

Waymo Co-CEO Dmitri Dolgov details the 20-year technical evolution from Google's self-driving moonshot to 500,000 weekly autonomous rides, explaining why full autonomy requires augmenting end-to-end AI with structured intermediate representations and a 'three teachers' training framework rather than relying solely on scaled-up vision models.

22 days ago · 9 points