CUDA: New Features and Beyond | NVIDIA GTC
TL;DR
This presentation outlines CUDA's evolution toward 'guaranteed asymmetric parallelism,' introducing Green Contexts to enable dynamic GPU resource partitioning for disaggregated AI inference workloads, while previewing future multi-node CUDA graphs that will orchestrate computations across entire data centers.
🔀 The Shift from Symmetric to Asymmetric Parallelism 4 insights
Traditional symmetric execution limits utilization
Conventional CUDA grid launches run identical workloads across all 160 SMs of a Blackwell GPU sequentially, preventing simultaneous execution of different tasks.
AI inference phases have opposing resource needs
Prefill phases are compute-bound requiring matrix operations, while decode phases are memory bandwidth-bound, making uniform provisioning inefficient.
Disaggregation delivers 10x performance gains
Running prefill and decode workers on separately configured GPU partitions eliminates resource starvation and right-sizes hardware for each phase.
Dynamic orchestration manages unpredictable workloads
NVIDIA Dynamo orchestrates these disaggregated systems, dynamically balancing resources between context-heavy queries and token-generation-heavy reasoning tasks.
🟢 Green Contexts: Dynamic GPU Partitioning 4 insights
Green contexts bridge streams and MPS
This new mechanism sits between CUDA streams (too opportunistic) and Multi-Process Service (too static) to enable guaranteed asymmetric parallelism within a single process.
Sandboxed resource allocation without code changes
Developers create descriptors to partition SMs (e.g., dividing Blackwell's 160 units), and kernels run oblivious to their constrained sandboxes, enabling true multiplexing.
Graphs span multiple green contexts
CUDA graphs can now capture workflows targeting different green contexts, allowing single-launch orchestration of heterogeneous tasks across partitioned GPU resources.
Enables dynamic reconfiguration patterns
Green contexts support nested hierarchies, low-latency reservations, and dynamic repartitioning to respond to changing workload demands without restarting applications.
🌐 Future: Data Center-Scale CUDA 3 insights
Graphs will span racks and data centers
NVIDIA aims to extend CUDA graphs beyond single nodes to orchestrate work across NVLink-72 racks and eventually 100,000+ GPU data centers as unified compute fabrics.
System-level naming and topology required
Multi-node execution requires CUDA to provide consistent naming conventions and topology awareness across complex dragonfly networks so all nodes agree on resource locations.
Centralized control without centralized bottlenecks
Future CUDA will enable single-controller orchestration of disaggregated workloads while maintaining fine-grained guarantees about where and when computations execute across the infrastructure.
Bottom Line
Adopt Green Contexts now to dynamically partition GPUs for asymmetric inference workloads, positioning applications for future data center-scale CUDA orchestration.
More from NVIDIA AI Podcast
View all
Build Video Analytics AI Agents with Skills
NVIDIA introduces the Video Search and Summarization (VSS) blueprint for building vision AI agents that process billions of camera streams using vision language models and a new 'skills' framework, enabling deep video search and summarization 60x faster than manual review.
Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs
NVIDIA researchers detail the development of Nemotron 3 Nano Omni, explaining how they evolved a text-only model into a multimodal system capable of processing vision, audio, and video through progressive training stages while maintaining the hybrid Mamba-Transformer architecture.
Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture
NVIDIA researchers Lynn Chai and Luc introduce TensorRT Edge LLM, a purpose-built inference engine for deploying large language models on Jetson edge devices, showcasing NVFP4 quantization and speculative decoding techniques that achieve up to 7x faster prefill speeds and 500 tokens per second generation while previewing a simplified vLLM-style Python API coming soon.
March 10 - Jetson AI Lab Research Group Call - Lightning talks
This Jetson AI Lab Research Group call features lightning talks on open-source hardware for remote Jetson access, a real-time emotional AI engine for robots running entirely on Jetson Nano, and updates to the Jetson AI Lab model repository with new performance benchmarks and deployment guides.