The world of voice AI, with Mati Staniszewski of ElevenLabs

Stripe

| Podcasts | April 14, 2026 | 12.6 Thousand views | 1:00:33

TL;DR

ElevenLabs CEO Mati Staniszewski explains how modern voice AI models predict phonemes with contextual awareness rather than using hard-coded parameters, enabling emergent properties like accents and emotions, while discussing the company's platform strategy and the deployment gap between capable models and consumer applications.

🧠 Technical Architecture 3 insights

Phonemes function as audio tokens

Voice models predict next phonemes (deconstructed syllable sounds) based on previous audio context and textual input, operating similarly to token prediction in LLMs.

Emergent voice characteristics replace hard-coding

Rather than using preset parameters for accents or emotions like early Bell Labs systems, the model deduces Britishness, enthusiasm, or sadness emergently through neural architecture.

Contextual prediction requires dual processing

The model simultaneously processes text construction and audio waveform generation to understand sentence context and render appropriate emotional inflection and prosody.

🏗️ Platform Strategy 3 insights

Horizontal infrastructure focus

ElevenLabs builds foundational models and infrastructure for voice agents, telephony, and creative tools while avoiding domain-specific vertical applications.

Direct deployment ensures model currency

Working directly with enterprises prevents intermediation risks where customers might remain stuck on outdated model versions instead of accessing weekly capability improvements.

Full-stack voice agent infrastructure

The platform combines TTS, STT, and conversational models with knowledge base integration, telephony connections, and safety monitoring for production deployment.

📱 Deployment Gap 3 insights

Consumer deployment lags capability

Despite three years of capable voice technology, integration into cars and phones remains rudimentary due to slow automotive adoption and OS limitations on third-party transcription engines.

ElevenReader fills distribution void

The consumer app enables PDF narration and AI audiobook creation using voices like Sir Michael Caine after traditional distributors blocked AI-generated audio content.

Enterprise adoption precedes automotive

Real-time contextual voice agents are currently deploying in enterprise settings, with in-car offline voice processing expected within 2-3 years.

📊 Data Innovation 2 insights

Specialized audio annotation teams

Built dedicated teams trained specifically to annotate emotional context, accents, and prosody in audio, as generic labelers lacked expertise to describe vocal characteristics.

Internal tools become commercial products

Speech-to-text models initially developed for internal data annotation evolved into commercial offerings, with production feedback loops continuously improving model accuracy.

Bottom Line

Voice AI's immediate opportunity lies not in improving core models but in closing the deployment gap by integrating existing capable technology into consumer devices and enterprise workflows through full-stack platforms.

Watch on YouTube

More from Stripe

10 Years of Stripe France: The tech renaissance and what’s next

Stripe

10 Years of Stripe France: The tech renaissance and what’s next

French tech leaders reflect on the ecosystem's transformation from early 2000s corporate culture to today's AI-driven renaissance, highlighting how reduced capital barriers and improved infrastructure are reshaping entrepreneurship.

3 days ago · 9 points

Stripe Sessions 2026 | Keynote

Stripe

Stripe Sessions 2026 | Keynote

Stripe Sessions 2026 marked the company's most ambitious product launch day in history, centered on building economic infrastructure for the AI era. The keynote revealed a parabolic spike in new business formation since January 2026 and introduced tools including the Machine Payment Protocol, Link wallet for agents, and Stripe Projects to enable autonomous agent-to-agent commerce.

11 days ago · 9 points

Sam Altman in conversation with Patrick Collison

Stripe

Sam Altman in conversation with Patrick Collison

Sam Altman discusses the recent 'parabolic' inflection in AI capabilities, particularly for coding with GPT 5.5 and Codex, while outlining OpenAI's evolution into a massive-scale 'intelligence utility' provider focused on automating general computer work through agents like OpenClaw.

11 days ago · 10 points

Nat Friedman and Daniel Gross in conversation with John and Patrick Collison

Stripe

Nat Friedman and Daniel Gross in conversation with John and Patrick Collison

Nat Friedman and Daniel Gross describe the current era as the 'slow part' of the singularity, predicting that AI will drive massive economic shifts, force continuous security hardening, and enable a new golden age of personal hardware tinkering where AI agents reverse engineer proprietary systems in hours.

11 days ago · 8 points

Browse more: 🎙️ Podcasts All Videos All Categories