How NVIDIA Blackwell and NVIDIA Dynamo Scale AI Agents for Production

NVIDIA AI Podcast

| Podcasts | June 29, 2026 | 248 views | 31:45

TL;DR

NVIDIA Blackwell delivers up to 40x more concurrent AI agents per GPU than Hopper through its rack-scale NVL72 architecture and Dynamo framework, fundamentally shifting AI infrastructure measurement from token throughput to agent concurrency benchmarks.

🎯 The Benchmark Revolution 3 insights

Traditional token metrics insufficient

Token-based measurements like cost per million tokens fail to capture end-to-end agentic trajectories involving tool calls, reasoning loops, and multi-step workflows.

New Agent Perf standard

Artificial Analysis's AA Agent Perf benchmark measures real coding trajectories using Deepseek V4, reporting maximum concurrent agents per accelerator rather than raw token generation.

Concurrency over throughput

The critical question for agentic infrastructure shifts from tokens per second to how many useful agents can run simultaneously while maintaining responsive interactivity.

⚡ Blackwell's Architectural Edge 3 insights

GB300 NVL72 rack-scale design

Connecting 72 GPUs via 1,800 GB/s NVLink eliminates network bottlenecks, enabling 64-GPU distribution of MoE models like Deepseek V4 with 4 experts per GPU.

Massive performance gains

Blackwell achieves 50x higher tokens per second per GPU and 35x lower cost per million tokens than Hopper, supporting 57 concurrent agents per GPU versus Hopper's 1.5 at 20 tokens/second interactivity targets.

Optimized for MoE communication

The architecture accelerates all-to-all expert communication patterns critical for mixture-of-expert models, replacing slower 100 GB/s Ethernet links with high-bandwidth NVLink domains.

🚀 Deploying with NVIDIA Dynamo 3 insights

Purpose-built distributed framework

Dynamo provides multi-GPU/multi-node orchestration, intelligent routing, and KV cache management specifically designed for agentic inference beyond standard engines like TensorRT-LLM.

Disaggregated serving architecture

Optimal deployments separate prefill and decode workers—typically 6 prefill-optimized workers feeding one large decode worker—to maximize throughput for agentic workloads.

Advanced parallelism strategies

Data Expert Parallelism replicates attention layers across GPUs while partitioning experts, outperforming traditional Tensor Expert Parallelism for high-concurrency agent scenarios.

Bottom Line

Organizations should transition from token-centric metrics to agent-concurrency benchmarks when evaluating infrastructure, deploying Blackwell NVL72 systems with NVIDIA Dynamo's disaggregated serving to achieve 40x scalability in production AI agent applications.

Watch on YouTube

More from NVIDIA AI Podcast

Build Video Analytics AI Agents with Skills

NVIDIA AI Podcast

Build Video Analytics AI Agents with Skills

NVIDIA introduces the Video Search and Summarization (VSS) blueprint for building vision AI agents that process billions of camera streams using vision language models and a new 'skills' framework, enabling deep video search and summarization 60x faster than manual review.

about 2 months ago · 9 points

Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs

NVIDIA AI Podcast

Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs

NVIDIA researchers detail the development of Nemotron 3 Nano Omni, explaining how they evolved a text-only model into a multimodal system capable of processing vision, audio, and video through progressive training stages while maintaining the hybrid Mamba-Transformer architecture.

about 2 months ago · 10 points

Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture

NVIDIA AI Podcast

Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture

NVIDIA researchers Lynn Chai and Luc introduce TensorRT Edge LLM, a purpose-built inference engine for deploying large language models on Jetson edge devices, showcasing NVFP4 quantization and speculative decoding techniques that achieve up to 7x faster prefill speeds and 500 tokens per second generation while previewing a simplified vLLM-style Python API coming soon.

about 2 months ago · 10 points

March 10 - Jetson AI Lab Research Group Call - Lightning talks

NVIDIA AI Podcast

March 10 - Jetson AI Lab Research Group Call - Lightning talks

This Jetson AI Lab Research Group call features lightning talks on open-source hardware for remote Jetson access, a real-time emotional AI engine for robots running entirely on Jetson Nano, and updates to the Jetson AI Lab model repository with new performance benchmarks and deployment guides.

about 2 months ago · 8 points

Browse more: 🎙️ Podcasts All Videos All Categories