Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

Stanford Online

123 views • 11 hours ago

Video Summary

The video features a conversation with Maddie, founder and CEO of 11 Labs, detailing the company's journey from its inception to its current status as a leader in AI-powered audio and speech technology. Initially inspired by the poor quality of voiceovers in Polish films, Maddie and his co-founder aimed to revolutionize audio content by enabling natural, emotionally resonant voice generation in any language. They explain their early adoption of a product-led growth strategy, leveraging platforms like Discord to gather user feedback and iterate on their technology. The conversation highlights key breakthroughs in text-to-speech synthesis, voice replication, and AI dubbing, tracing the evolution of their models from 2022 to projections for 2026. An interesting fact is that 11 Labs' models can now be brought on-device, though a quality gap still exists compared to cloud-based versions.

The discussion delves into the technical challenges and innovations, including the shift from cascaded to fused model architectures and the critical importance of data in training expressive and reliable AI voices. Maddie touches upon the business aspects, such as achieving significant ARR growth and their pricing strategy based on customer value. The conversation also addresses the ethical implications of AI voice technology, focusing on safety, security, and combating misuse. They emphasize a collaborative approach, acknowledging contributions from other companies and the importance of open-source contributions. The video concludes with a look at the future of 11 Labs, predicting a platform-centric approach to empower businesses and creators with advanced audio tools, and highlighting their work with individuals who have lost their voices and their contributions to supporting Ukraine.

Short Highlights

11 Labs was founded to address the poor quality of voiceovers in foreign films, particularly in Poland, where a single narrator voices all characters.
The company initially focused on a product-led growth strategy, engaging with creators and developers on platforms like Discord to refine their technology.
Key innovations include advancements in replicating voice characteristics, achieving natural and emotional speech delivery, and improving AI dubbing capabilities.
11 Labs has experienced rapid growth, exceeding $430 million in ARR within 36 months, and operates with a decentralized, small-team structure.
The future vision includes leading foundational research in audio, becoming a go-to platform for conversational AI, and ensuring responsible development of AI voice technology.

Key Details

Introduction to 11 Labs and its Mission [0:07]

11 Labs, initially known as "AI Coachella," is a company focused on frontier audio and speech technology.
The company's origin traces back to a text-to-speech bot on Discord that gained significant traction for its ability to generate audio clips from text prompts.
The founder, Maddie, was an early angel investor in 11 Labs.
11 Labs has grown into a widely used and trusted brand in the audio and speech AI space.

"we want to fix two things. We want to fix the research and foundational models around audio and voice and then build product around that to to bring that AI into more of an applied AI setting and fix the problems that our customers are facing."

The Genesis of 11 Labs: A Problem-Observed Origin [0:50]

Maddie and his co-founder, both from Google and Palantir, aimed to build a company differently, being allergic to excessive meetings and traditional internal communication.
They initially ran their company on Discord and explored using it as a base for text-to-speech bots.
Gaming platforms like Discord are identified as "petri dishes" for innovation, often solving complex infrastructure and user experience problems that later influence broader industries.
11 Labs' initial strategy was Product-Led Growth (PLG), engaging with creators and developers to close the feedback loop and understand user needs and use cases.

"And that's still a big tissue today of our work across that is um we want to work with the community to find a ways for them to contribute back to the product development."

The Inspiration: A Polish Dubbing Dilemma [5:04]

The core problem that inspired 11 Labs stemmed from the experience of watching foreign movies in Poland, where a single male narrator dubbed all characters, regardless of gender or emotion.
This led to a "terrible experience" with flat, monotone deliveries, forcing the audience to interpret emotions themselves.
The founders envisioned a future where all content could be accessed in any language with incredible tonality and emotional depth.
They left Google and Palantir to pursue this vision, initially operating between Warsaw and London.

"So that was the first piece and inspiration for us. We know the future is different. The future will be where you can access all types of content in any language uh with that incredible tonality, incredible emotions."

Research and Product Development: From Dubbing to Voice Correction [7:07]

The initial focus was on solving the AI dubbing problem, which requires transcription, translation, and text-to-speech synthesis.
Deep dives revealed that existing models were not advanced enough for high-quality dubbing but showed potential for creating a "Frankenstein version."
User outreach to creators and studios revealed a demand for simpler voiceover corrections and the ability to replace their own voice in recordings.
This led to a strategic shift to focus research on specific components and address more immediate user problems, such as voice correction and text-to-audio narration.

"So there are those three key models, transcription, translation, and and the text to speech on the other side. And you need to fix all the components to make it good."

The AI Pipeline and Early 2022 Breakthroughs [9:25]

The initial AI pipeline considered at 11 Labs involved Speech-to-Text, an LLM for reasoning, and then Text-to-Speech.
In 2022, with crypto and the metaverse dominating the tech discourse, 11 Labs focused on improving the "last mile" of generation: making text sound human and emotional.
Key challenges at the time included the inability to replicate voice characteristics accurately and to maintain consistent emotional delivery based on context.
Innovations focused on allowing models to learn voice parameters more abstractly rather than hard-coding them, and incorporating context from LLMs to improve delivery.

"So you decided not to innovate on the transcription part or the LLM part just the last mile which was generation and you said the mission there was let's try to improve the state-of-the-art in uh like it had to sound natural."

Groundwork and Early Investment: Open Source and Compute [13:09]

The initial development was funded by personal savings.
Being close to users was crucial for quick, interactive feedback loops.
They explored open-source models, finding inspiration in models like "Tortoise" from James Baiyari, which produced human-like audio but was slow and unstable.
The research also drew from papers on diffusion models and transformer architectures, combining these ideas for a novel text-to-speech and voice creation approach.
Initial compute costs were in the tens of thousands of dollars, considered significant at the time, and they opted against expensive patent filings.

"So one thing that I recommend uh if you are looking to start a side project or a company is uh is look through all the uh accelerator and quotes programs from big companies that give you free compute and free credits."

The Evolution of 11 Labs: Models, Platforms, and Applications [17:39]

By 2022, 11 Labs had a clear focus on voice flexibility and generation.
Their research expanded beyond text-to-speech to include transcription, broader AI dubbing, conversational AI, and even music generation.
They developed a platform to help businesses and solo creators transform audience interaction through agents, support, sales, and marketing.
A significant milestone was the AI dubbing of Javier Milei's speech in 2024, demonstrating the integration of transcription, LLM translation, and speech generation.

"So the entirety of audio and how voice models or audio models work together with other modalities and then uh alongside build a platform that helps businesses and developers um uh uh um or solo creators transform how they interact with their audience or how they interact with their uh their um uh uh their people um through agents and support and sales uh with creative tools and marketing and storytelling."

The Frontier of Understanding: Beyond Transcription to Semantic Audio Comprehension [22:04]

The discussion shifts to the future, focusing on the need for AI systems, particularly voice agents, to have deeper reasoning capabilities, understanding tone, inflection, and emotion, which pure transcription misses.
Current systems like ChatGPT's advanced voice mode still struggle to interpret emotional states from speech.
The challenge lies in moving beyond simple transcription to a more profound semantic understanding of audio input.

"Especially an audio agent that can understand the tone, the voice, the the inflection, the accent of what's coming in because you lose all that context when you transcribe just pure audio."

Cascaded vs. Fused Architectures: Navigating the AI Pipeline [23:56]

Two primary approaches for AI audio systems are discussed: cascaded (separate models for transcription, LLM, TTS) and fused (a single, integrated model).
The cascaded approach offers better reliability and control for enterprise use cases, while the fused approach may offer speed advantages for applications where reliability is less critical.
11 Labs leans towards the cascaded approach for business applications due to its emphasis on reliability and intelligence.
Key parameters for evaluation include quality/emotionality, reliability, and latency, with expressivity and emotionality being areas of significant recent breakthroughs.

"We think the cascaded approach is the right thing for the next for the next few years. And if you are thinking about places where maybe the reliability isn't as essential but the speed is um we think the fused approach will be"

Collaboration and Ecosystem: Beyond Competition [30:53]

Maddie's emphasis on acknowledging other teams, like Sesame, highlights a collaborative ethos in the AI industry.
This contrasts with a purely competitive mindset, with the understanding that collective progress is crucial in a nascent field like voice AI.
11 Labs and Sesame have a history of collaboration, including angel investing and open-sourcing of research.
The speaker stresses that industry labels and categories are often artificial constructs, and human collaboration drives frontier progress.

"you can go further together, especially in a new space like this where often what seems like a competitive project just because the VCs or the business ecosystem is trying to create some nice like landscape slide that says here's audio AI and here are the logos and here's like you know visual AI."

Business Success and Revenue Generation: Value-Driven Growth [35:11]

11 Labs has achieved remarkable revenue growth, exceeding $430 million ARR in 36 months.
This success is attributed to providing clear value to customers, with a focus on applied AI and transforming how businesses interact with their audiences.
The company maintains a structure of small, highly autonomous teams, emphasizing ownership and rapid decision-making.
Pricing and packaging are determined by the value delivered to the customer, working backward from that assessment rather than from the cost of production.

"So never start from the cost start from the value and work backwards from there."

Security, Safety, and Combating Misuse of Voice AI [42:45]

11 Labs builds safety into its models from the start, including traceability, moderation, and fraud detection.
They advocate for publicly available systems for AI detection and watermarking.
Voice authentication in banking is seen as a security risk, with a recommendation to move away from it.
An example of using voice agents against scammers to waste their time is shared, highlighting "counter-offensive" strategies.

"We think this is not the future and you should step away from this and not use that as an authentication side. We think that's uh from the security perspective it's the it's the it's the it's the wrong approach."

Bottlenecks, Innovation, and the Future of 11 Labs [44:32]

Key bottlenecks include securing incredible talent, advancing architectural research, and obtaining sufficient compute.
A major research goal is to achieve truly interactive AI that understands emotion and can pull knowledge from systems, acting as a personalized extension of the user.
The future vision for 11 Labs involves leading in foundational audio research, becoming a dominant platform for conversational AI, and continuing to innovate across models and applications.
They aim to provide tools that allow businesses and creators to seamlessly build AI-powered applications.

"So maybe the nonobvious one it's um you know every every time you interact with different service you interact with different experience you will have your own preference of of what's uh what's good for you and what's good for everybody else."

Real-World Impact: Restoring Voices and Supporting Ukraine [51:12]

11 Labs is proud of its work with individuals who have lost their voices due to conditions like ALS or cancer, synthesizing their voices back for communication.
They have also collaborated with the Ukrainian government, integrating voice capabilities into the DIA citizen app to provide accessible government services and information, particularly in challenging circumstances.
This work highlights the potential for AI to serve critical humanitarian and governmental needs.

"One of the proudest work we do at 11 Labs is actually working with people that lost their voice and we can bring it back. So people with ALS or fraud cancer uh and and so far we've been able to work with almost 10,000 people that lost it and we could synthesize it back so they could communicate uh um naturally."

Geopolitical Considerations: China and Open Source [54:46]

The discussion touches upon the AI race between the West and China, with 11 Labs actively working to prevent distillation attacks.
While acknowledging strong models emerging from China, especially for specific language nuances, 11 Labs aims to outcompete them through better service and innovation.
A dichotomy is noted in how IP is approached, with different strategies in Western versus Chinese companies, particularly concerning Disney and Netflix IP.
The importance of a thriving open-source ecosystem is emphasized for continued innovation and customization.

"So there's a I think the two different parts one is how you think about the wider ecosystem participating in the model. to how you bring um uh bring the safety parameters around around around around the models in in tandem and free uh in general of course very helpful that open source ecosystem continues"

Studios, Economics, and the Future of Content Creation [59:32]

Studios are hesitant to fully adopt AI voiceovers due to fear of "AI slop" and backlash, as well as unfigured-out economic models.
11 Labs advocates for a "middle-to-middle" approach, where AI tools augment the creative process rather than replacing it entirely.
A breakthrough in controllability of AI voice delivery has led to increased studio adoption in the last six months.
The economics of AI voiceovers, including respecting IP and determining fair pricing, remain a key challenge.

"So to make it more specific um until very recently you were in the speech side at least you would give a model a text you would rely on the model to read out the text in the way the model uh thought is best and you could regenerate it."

On-Device Models and Future Platform Strategy [01:03:09]

11 Labs has figured out how to bring models on-device, though quality is currently lower than cloud versions.
The focus is on fixing quality first before prioritizing on-device deployment.
The future of 11 Labs involves being the go-to platform for conversational AI, providing not just models but also the necessary tooling for businesses and creators to integrate AI into their workflows.
This includes enabling advanced applications like interactive support, sales, marketing, and internal training.

"So the ondevice version will do text to speech but you still won't have the wider transcription interactivity how you transfer the emotions from one side to the other um how you make it uh uh with additional kind of reliability elements built built in."