CUDA: New Features and Beyond | NVIDIA GTC

| Podcasts | March 31, 2026 | 12.1 Thousand views | 44:27

TL;DR

This presentation outlines CUDA's evolution toward 'guaranteed asymmetric parallelism,' introducing Green Contexts to enable dynamic GPU resource partitioning for disaggregated AI inference workloads, while previewing future multi-node CUDA graphs that will orchestrate computations across entire data centers.

🔀 The Shift from Symmetric to Asymmetric Parallelism 4 insights

Traditional symmetric execution limits utilization

Conventional CUDA grid launches run identical workloads across all 160 SMs of a Blackwell GPU sequentially, preventing simultaneous execution of different tasks.

AI inference phases have opposing resource needs

Prefill phases are compute-bound requiring matrix operations, while decode phases are memory bandwidth-bound, making uniform provisioning inefficient.

Disaggregation delivers 10x performance gains

Running prefill and decode workers on separately configured GPU partitions eliminates resource starvation and right-sizes hardware for each phase.

Dynamic orchestration manages unpredictable workloads

NVIDIA Dynamo orchestrates these disaggregated systems, dynamically balancing resources between context-heavy queries and token-generation-heavy reasoning tasks.

🟢 Green Contexts: Dynamic GPU Partitioning 4 insights

Green contexts bridge streams and MPS

This new mechanism sits between CUDA streams (too opportunistic) and Multi-Process Service (too static) to enable guaranteed asymmetric parallelism within a single process.

Sandboxed resource allocation without code changes

Developers create descriptors to partition SMs (e.g., dividing Blackwell's 160 units), and kernels run oblivious to their constrained sandboxes, enabling true multiplexing.

Graphs span multiple green contexts

CUDA graphs can now capture workflows targeting different green contexts, allowing single-launch orchestration of heterogeneous tasks across partitioned GPU resources.

Enables dynamic reconfiguration patterns

Green contexts support nested hierarchies, low-latency reservations, and dynamic repartitioning to respond to changing workload demands without restarting applications.

🌐 Future: Data Center-Scale CUDA 3 insights

Graphs will span racks and data centers

NVIDIA aims to extend CUDA graphs beyond single nodes to orchestrate work across NVLink-72 racks and eventually 100,000+ GPU data centers as unified compute fabrics.

System-level naming and topology required

Multi-node execution requires CUDA to provide consistent naming conventions and topology awareness across complex dragonfly networks so all nodes agree on resource locations.

Centralized control without centralized bottlenecks

Future CUDA will enable single-controller orchestration of disaggregated workloads while maintaining fine-grained guarantees about where and when computations execute across the infrastructure.

Bottom Line

Adopt Green Contexts now to dynamically partition GPUs for asymmetric inference workloads, positioning applications for future data center-scale CUDA orchestration.

More from NVIDIA AI Podcast

View all
Securing Long-Running AI Agents: From Setup to Sandboxing
45:01
NVIDIA AI Podcast NVIDIA AI Podcast

Securing Long-Running AI Agents: From Setup to Sandboxing

NVIDIA details the shift toward autonomous 'long-running' AI agents capable of independent multi-hour execution, introducing the NVIDIA Agent Toolkit featuring open Neotron models, packaged CUDA-X skills, and runtime security to enable scalable enterprise deployment.

3 days ago · 7 points
Build Video Analytics AI Agents with Skills
59:53
NVIDIA AI Podcast NVIDIA AI Podcast

Build Video Analytics AI Agents with Skills

NVIDIA introduces the Video Search and Summarization (VSS) blueprint for building vision AI agents that process billions of camera streams using vision language models and a new 'skills' framework, enabling deep video search and summarization 60x faster than manual review.

about 2 months ago · 9 points
Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs
48:56
NVIDIA AI Podcast NVIDIA AI Podcast

Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs

NVIDIA researchers detail the development of Nemotron 3 Nano Omni, explaining how they evolved a text-only model into a multimodal system capable of processing vision, audio, and video through progressive training stages while maintaining the hybrid Mamba-Transformer architecture.

about 2 months ago · 10 points