Build an AI app that watches videos using Gemini

Google Cloud Tech

338 views • 13 days ago

Video Summary

A revolutionary approach streamlines the creation of content from videos. Instead of a complex pipeline involving audio extraction, speech-to-text, OCR for slides, and summarization, a single API call to a multimodal AI model can achieve the same results. This model can process video content, including audio and visual elements, to generate outputs like blog posts and header images based on a provided URL and a well-crafted prompt.

The process is remarkably efficient. A YouTube URL is fed into an application, which then makes a single API call to the AI. The AI analyzes the video, understands its content, and generates a blog post, complete with a header image generated by a second API call based on the blog post's title. This entire operation, including the complex tasks of understanding video, audio, and generating text and images, can be accomplished with just two API calls and minimal supporting code, significantly reducing development effort.

The cost-effectiveness and flexibility of this method are also highlighted. Pricing is based on token usage, with initial free quotas and options to reduce costs by using lower resolution. The API can also process video data directly from uploads or cloud storage, not just YouTube URLs. Prompt engineering is presented as a key element, with the AI itself being used to generate initial prompts that are then refined through iteration, adding context, defining roles, and establishing layout rules for optimal output. This pattern unlocks new possibilities beyond blog posts, such as creating audio scripts or video highlight reels, all powered by the AI's multimodal capabilities and the strategic use of prompts.

Short Highlights

A complex pipeline for processing video content (audio extraction, speech-to-text, OCR, summarization) can be replaced by a single API call to a multimodal AI model.
The AI can ingest a video URL and generate a complete blog post with a header image, requiring only the URL and a text prompt.
The cost of processing a one-minute video is approximately half a US cent after exhausting a daily free quota, with options to reduce costs further.
The AI can handle video data from various sources, including direct uploads and cloud storage, in addition to YouTube URLs.
Prompt engineering is crucial, with initial prompts being generated by the AI and then iteratively refined by adding context, defining roles, and setting layout rules.

Key Details

Streamlining Video Content Creation with a Single API Call [0:00]

Traditionally, building an app to process video involved a complex pipeline: pulling audio, using speech-to-text, OCR for slides, and a summarizer.
A new approach replaces this entire pipeline with a single API call to Gemini 2.5.
This AI model can watch a video and generate content based on its analysis.

The entire pipeline is now just a single API call to Gemini 2.5.

Demonstrating the Simplified Process: A YouTube to Blog Post App [0:30]

An application allows users to input a YouTube URL and generate a blog post and header image.
The application makes a single API call to Gemini after receiving the YouTube link.
The generated blog post content is based on the video's content and a specific prompt.

So the code made an API call to Gemini, and now it's thinking, uh, yep, it depends on the video's length, but I picked a short one. Let's give it a minute.

The Core Logic: API Call and Prompting [0:59]

The application's core logic involves a function that takes a YouTube link and a model name.
It then constructs a request with the link and a text prompt, sending it to Google's GenAI.
The AI processes the video, including audio and visual aspects, and writes the article based on the prompt, without needing a pre-supplied transcript.

That one call tells Gemini to watch the video, listen to the audio, and write the whole article based on the prompt.

Crafting the Prompt for Content Generation [2:12]

The effectiveness of the output heavily relies on the prompt provided to the AI.
A get blog gen prompt function returns the prompt, which includes a persona for the AI to adopt and detailed instructions in markdown format.
The prompt guides the AI on how to structure and generate the content.

And this is what the prompt looks like without any code around it.

Generating a Header Image with a Second API Call [2:43]

A separate API call is used to generate the header image for the blog post.
This function takes the blog post title as input and creates a prompt for an image model.
The image model then generates a single PNG image, which is encoded for web display.

Aha, that's the second API call. It's in the generate image function.

Cost and Flexibility of the AI Service [3:21]

The cost is calculated based on token usage, with 1 second of video equating to approximately 300 tokens.
A one-minute video costs about half a US cent at the current price of Gemini 2.5 Flash (30 cents per million tokens), after a daily free quota is used.
Costs can be reduced by two-thirds by switching to low-resolution processing.
The API can process video data directly, not just YouTube URLs, accepting MP4 uploads or pointing to videos in cloud storage.

At the current price of Gemini 2.5 Flash, which is 30 cents per million tokens, that one minute would cost about half a US cent.

The Iterative Process of Prompt Engineering [4:10]

Prompt engineering is an iterative process, not always perfect on the first try.
The AI can be used to generate an initial prompt, which then requires refinement in tone and structure.
This involves adding background context, defining the AI's role, and establishing clear layout rules.

I actually had the AI generate the first prompt as a starting point. But the tone and structure needed work. From there, it was all about iterating.

Versatility Beyond Blog Posts: Customizing Output with Prompts [4:34]

The prompt can be modified to generate different types of content, such as bullet-point summaries, notes, or quizzes.
Prompts can be stored externally in text files or databases, allowing for app updates without code deployment.
This flexibility makes it easy to change the app's output format based on user needs.

Well, just change the prompt. That's the beauty of Gemini.

The Multimodal Pattern: Beyond Text and Images [4:56]

This approach represents a pattern where a multimodal AI takes in text, audio, and video, and can output in these formats as well.
This opens up possibilities for creating audio scripts from blog posts or editing meeting videos into highlight reels.
The core concept is the power of a single API call to a multimodal model combined with effective prompt engineering.