The New Gemini Text to Speech

Prompt Engineering

8,065 views • Save 8 min (5 min read) • 7 months ago

Video Summary

The video explores Gemini Text-to-Speech (TTS), a new speech generation model built on the Gemini series, rivaling dedicated TTS models. Its strength lies in its foundation as a large language model, allowing users to describe desired speech effects—like dramatic movie trailer voices or whispers—through natural language prompts. This approach eliminates the need for specialized tokens and enables the generation of multiple speakers, emotions, and accents. The Gemini TTS models are currently in preview and available via API, with both 2.5 Pro and 2.5 Flash versions accessible, though Flash is noted for its speed and adherence to instructions. A surprising capability is its ability to generate specific speech effects, such as pauses within a sentence, and to support a wide range of languages, including less common ones.

Short Highlights

Gemini Text-to-Speech models, built on the Gemini series, offer high-quality speech generation comparable to dedicated TTS models.
Users can describe desired speech effects, emotions, and tones using natural language prompts, eliminating the need for specialized tokens.
The models support generating multiple speakers, various accents, and a wide array of emotions.
Both Gemini 2.5 Pro and 2.5 Flash versions are available via API, with Flash noted for speed and instruction adherence.
Supports speech output in up to 24 different languages, including European, Eastern, and Asian languages.

Key Details

Gemini Text-to-Speech Overview [00:00]

Gemini Text-to-Speech models are built on the Gemini series and rival the quality of dedicated speech models.
A key advantage is their LLM foundation, allowing users to describe desired speech effects and emotions using natural language.
This enables the generation of specific effects like dramatic movie trailer voices with epic pauses.

"The beauty is that you don't need any specialized tokens for special effects anymore."

Enhanced Capabilities and Accessibility [00:50]

The models offer the ability to generate multiple speakers simply by describing what the speaker should be talking about.
This technology can produce a range of emotions, accents, and can even whisper and shout, making the output more dynamic.
The models are in preview and offer a good indication of Google's direction in speech and voice technology.

"Me, I've heard TTS before. They all sound robotic. No, no, no. This one is different."

Building with Gemini TTS API [02:43]

A quick notebook example is provided for developers to build on top of the API.
Users need a Gemini API key and the Google generative AI SDK (version greater than 1.16).
Both Gemini 2.5 Pro and 2.5 Flash are available for TTS, with Flash often performing better due to speed and instruction adherence.

"First it's a lot faster. Second it seems to stick to the instructions more closely compared to the pro version."

Core Functionality and Natural Language Control [03:44]

The basic usage of the generative AI SDK for TTS is similar to using LLMs, requiring a model ID and specifying audio modality.
Prompts can be structured with instructions for the model and the actual content of the speech.
The model can interpret and apply described effects, such as pauses, with remarkable accuracy.

"You can naturally describe what the effects are going to be and the model will stick to it for the most part."

Voice Selection and Advanced Control [05:03]

Users can select specific pre-built voices provided by Google through configurations.
The models can adhere to specific instructions like pausing for a set duration within a sentence.
A large number of languages are supported, allowing for speech generation in various linguistic contexts.

"So for example, we can play this. [05:15] >> I am a very knowledgeable model, especially when using grounding, don't you think?"

Multilingual Support and Prompting Guidance [06:02]

Prompting allows control over style, tone, accent, and pace using natural language.
The models can generate speech in up to 24 different languages, including European, Eastern, and Asian languages like Arabic and Hindi.
This broad language support is noted as a significant advantage, as these languages are often omitted in other systems.

"And the good news is that it has support for not only European languages but some of the eastern and Asian languages as well including Arabic, Hindi which is pretty nice..."

Context Window and Prompting Structure [10:48]

The context window for these TTS models is 32,000 tokens, differing from the 1 million token window of the base Gemini model.
Google recommends a prompting structure including an audio profile (character identity), scene description (environment and vibe), and director notes (performance guidance).
This structured prompting allows for the generation of well-crafted speech outputs.

"So, an audio profile that defines the character's core identity and archetype. Then, a scene description that establishes the physical environment and emotional vibe. and then director nodes that offer more precise performance guidance regarding style, accent and pace control."

Pricing and Batch Processing [12:21]

Pricing is $0.50 per million input tokens (text) and $10 per million output tokens (audio).
The Pro version is double the price of the Flash version, which is generally sufficient for most tasks.
Batch processing reduces the pricing to half of the original cost, making it more economical for bulk generation.

"Anyways, it's a really great option, especially for people who are thinking of building applications that are going to be powered by AI voice and speech."