Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind

AI Engineer

| Podcasts | May 18, 2026 | 1.6 Thousand views | 1:17:14

TL;DR

Guillaume Vernade from Google DeepMind demonstrates how to build multimodal content pipelines using the new GenMedia suite (Nano Banana 2, Veo 3.1, and Lyria) via the Gemini Developer API, showcasing a live workshop that transforms text into illustrated books with AI-generated images, video, and music.

🎨 GenMedia Model Ecosystem 4 insights

Nano Banana 2 adds image grounding and 4K support

The latest image generation model supports aspect ratios from 520px to 4K and introduces image grounding, allowing the model to search and reference real-world images for architectural or biological accuracy.

Veo 3.1 Light enables cheap video iteration at $0.05/second

Designed for rapid prototyping, this lightweight video model costs approximately 40 cents per video, allowing developers to test prompts before upscaling to higher quality tiers.

Lyria Real-Time uses predictive architecture for live music

Unlike diffusion-based models, this system generates music continuously with only 2 seconds of latency, enabling real-time style mixing and DJ-like transitions via live prompting.

DeepMind maintains high-velocity release cycles

GenMedia models ship updates monthly on average, while the broader DeepMind organization releases new features approximately every five days, including the recent Gemma 4 launch.

⚙️ Developer Platform Strategy 3 insights

Developer API bridges consumer and enterprise tiers

Positioned between AI Studio (consumer) and Vertex AI (enterprise), the Gemini Developer API offers simplified access while maintaining SDK compatibility for seamless migration to Vertex when production-grade controls are needed.

New service tiers optimize for cost vs. latency

Developers can choose Flex tier (50% discount, delayed processing) for batch jobs or Priority tier (2x cost, guaranteed fast track) for real-time applications, with automatic retry logic handling peak-load failures.

File upload API abstracts cloud storage complexity

The Developer API automatically handles bucket creation and ACL management that Vertex AI requires manual configuration for, allowing direct file uploads accessible to models without infrastructure setup.

📚 Multimodal Application Architecture 3 insights

Chat mode maintains stylistic consistency across long-form content

Leveraging Gemini's large context windows, the chat mode retains historical generation data (e.g., character descriptions from earlier chapters) to ensure visual consistency when illustrating books or serialized content.

Structured output enforces predictable content pipelines

JSON schema constraints enable reliable parsing of model outputs, allowing automated systems to extract specific fields like character names, scene descriptions, and prompt metadata for downstream media generation.

Cost-effective prototyping for multimedia books

The demonstrated pipeline processing 'The Wind in the Willows' costs approximately $1 per run, utilizing Nano Banana 2's free tier for image testing and reserving paid Veo generation for final video assets.

Bottom Line

Developers should leverage the Gemini Developer API's chat mode with structured JSON outputs to build cost-effective multimodal pipelines, selecting Flex tier for batch processing and Priority tier only when low latency is critical, while using the file upload API to avoid cloud storage configuration overhead.

Watch on YouTube

More from AI Engineer

Frontier results, on device - RL Nabors, Arize

AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

6 days ago · 10 points

The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents

AI Engineer

The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents

Justin Schroeder argues that the future of AI lies in domain-specific agents—small, specialized agents that compose together rather than general-purpose agents bloated with tools and skills, delivering 80%+ token efficiency and 137x cost savings compared to monolithic approaches.

6 days ago · 9 points

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

AI Engineer

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.

6 days ago · 10 points

Bypassing the Multimodal Tax: Hybrid RAG, SQL RRF & UI Telemetry - Abed Matini, Ogilvy

AI Engineer

Bypassing the Multimodal Tax: Hybrid RAG, SQL RRF & UI Telemetry - Abed Matini, Ogilvy

Abed Matini presents a framework-free Hybrid RAG architecture that eliminates pre-query token costs by preprocessing documents locally using DocLink and multiple chunking strategies, while implementing SQL-based Reciprocal Rank Fusion and LangFuse telemetry for production observability.

6 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories