Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

| Podcasts | May 04, 2026 | 273 views | 1:06:26

TL;DR

ElevenLabs CEO Mati Staniszewski explains how the company pivoted from an AI dubbing vision to perfecting text-to-speech by staying close to Discord communities, leveraging open-source research, and running lean to solve the 'one voice' dubbing problem he experienced growing up in Poland.

🎭 Origin and Problem Definition 3 insights

The Polish dubbing monopoly inspiration

Growing up with monotone single-narrator voiceovers for all movie characters in Poland inspired the mission to preserve emotional voice characteristics across languages.

Pivoting from dubbing to voiceover

Customer research revealed creators urgently needed simple voiceover corrections more than full dubbing, shifting early focus to text-to-speech generation.

Narrowing scope to English first

Despite multilingual ambitions, the team focused initially on perfecting emotional English text-to-speech in 2022 rather than the full transcription-translation pipeline.

🔬 Technical Architecture Decisions 3 insights

Learning voice characteristics organically

They abandoned manually programming gender, age, and accent variables in favor of models that abstract these parameters automatically through transformers.

Leveraging open-source breakthroughs

Early architecture drew inspiration from James Betker's Tortoise model, which achieved human-like short-form speech as a Google side project.

Applying LLM context awareness

They utilized next-token prediction breakthroughs to help models understand broader textual context for appropriate emotional delivery.

🚀 Community-Driven Execution 3 insights

Operating on Discord initially

The founders ran the entire company on Discord with custom bots to avoid meetings and email, creating tight feedback loops with early creator communities.

Maximizing limited compute budgets

They trained first checkpoints using under $100,000 in free GPU credits from programs like NVIDIA Inception while skipping a $6,000 patent to preserve cash.

Staying problem-obsessed with users

Through product-led growth and community proximity, they discovered high-demand use cases like audiobook creation that weren't in the original roadmap.

Bottom Line

Solve one narrow technical problem exceptionally well while embedding deeply with your user community to discover real demand, rather than building the full vision from day one.

More from Stanford Online

View all
Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence
1:01:14
Stanford Online Stanford Online

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

Andreas Blattmann, co-founder of Black Forest Labs and co-creator of Stable Diffusion, argues that visual intelligence represents the critical next frontier for AI, requiring a fundamental shift from text-centric unimodal models to multimodal systems trained on 'natural representations' (video, audio, physics) to unlock true reasoning, robotics capabilities, and higher intelligence.

about 12 hours ago · 9 points
Stanford's Code in Place Info Session with Mehran Sahami
55:37
Stanford Online Stanford Online

Stanford's Code in Place Info Session with Mehran Sahami

Stanford professors Mehran Sahami and Chris Peach present Code in Place, a free 6-week global Python program achieving 50-60% completion rates—over 10x higher than typical online courses—by pairing thousands of volunteer section leaders with small student cohorts for personalized, human-centric instruction.

about 13 hours ago · 9 points
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 9: Scaling Laws
1:17:57
Stanford Online Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 9: Scaling Laws

This lecture introduces scaling laws as predictive power-law relationships that enable practitioners to optimize language model training on small budgets and confidently extrapolate performance to million-dollar large-scale runs, while tracing these empirical patterns back to classical machine learning theory and sample complexity research from the 1990s.

4 days ago · 9 points