Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind
TL;DR
Guillaume Vernade from Google DeepMind demonstrates how to build multimodal content pipelines using the new GenMedia suite (Nano Banana 2, Veo 3.1, and Lyria) via the Gemini Developer API, showcasing a live workshop that transforms text into illustrated books with AI-generated images, video, and music.
🎨 GenMedia Model Ecosystem 4 insights
Nano Banana 2 adds image grounding and 4K support
The latest image generation model supports aspect ratios from 520px to 4K and introduces image grounding, allowing the model to search and reference real-world images for architectural or biological accuracy.
Veo 3.1 Light enables cheap video iteration at $0.05/second
Designed for rapid prototyping, this lightweight video model costs approximately 40 cents per video, allowing developers to test prompts before upscaling to higher quality tiers.
Lyria Real-Time uses predictive architecture for live music
Unlike diffusion-based models, this system generates music continuously with only 2 seconds of latency, enabling real-time style mixing and DJ-like transitions via live prompting.
DeepMind maintains high-velocity release cycles
GenMedia models ship updates monthly on average, while the broader DeepMind organization releases new features approximately every five days, including the recent Gemma 4 launch.
⚙️ Developer Platform Strategy 3 insights
Developer API bridges consumer and enterprise tiers
Positioned between AI Studio (consumer) and Vertex AI (enterprise), the Gemini Developer API offers simplified access while maintaining SDK compatibility for seamless migration to Vertex when production-grade controls are needed.
New service tiers optimize for cost vs. latency
Developers can choose Flex tier (50% discount, delayed processing) for batch jobs or Priority tier (2x cost, guaranteed fast track) for real-time applications, with automatic retry logic handling peak-load failures.
File upload API abstracts cloud storage complexity
The Developer API automatically handles bucket creation and ACL management that Vertex AI requires manual configuration for, allowing direct file uploads accessible to models without infrastructure setup.
📚 Multimodal Application Architecture 3 insights
Chat mode maintains stylistic consistency across long-form content
Leveraging Gemini's large context windows, the chat mode retains historical generation data (e.g., character descriptions from earlier chapters) to ensure visual consistency when illustrating books or serialized content.
Structured output enforces predictable content pipelines
JSON schema constraints enable reliable parsing of model outputs, allowing automated systems to extract specific fields like character names, scene descriptions, and prompt metadata for downstream media generation.
Cost-effective prototyping for multimedia books
The demonstrated pipeline processing 'The Wind in the Willows' costs approximately $1 per run, utilizing Nano Banana 2's free tier for image testing and reserving paid Veo generation for final video assets.
Bottom Line
Developers should leverage the Gemini Developer API's chat mode with structured JSON outputs to build cost-effective multimodal pipelines, selecting Flex tier for batch processing and Priority tier only when low latency is critical, while using the file upload API to avoid cloud storage configuration overhead.
More from AI Engineer
View all
Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic
Anthropic engineers Ash Prabakar and Andrew Wilson explain how to build AI agents that run for hours or days by combining model improvements with strategic 'harness' scaffolding that solves context limitations, planning failures, and unreliable self-evaluation through persistent state management, verification loops, and deterministic orchestration patterns.
Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft
Marlene Mhangami presents data showing GitHub code creation accelerating to 14 billion projected commits in 2026, driven by AI agents. She argues that true productivity gains require clean codebases and advocates for behavior-driven test development using Playwright with AI agents, where developers focus on refactoring while AI handles test generation and initial code implementation.
Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
Laurie Voss presents a practical framework for evaluating AI agents, emphasizing the shift from manual 'vibe checks' to automated test suites that combine code evals, LLM judges, and human validation to catch cascading failures in production systems.
Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft
Microsoft's Amy Boyd and Nitya Narasimhan present the 'Mind the Gap' framework for AI agent observability, emphasizing continuous evaluation, OpenTelemetry tracing, and integrated safety guardrails to bridge the divide between development requirements and production reality.