CUDA: New Features and Beyond | NVIDIA GTC
TL;DR
This presentation outlines CUDA's evolution toward 'guaranteed asymmetric parallelism,' introducing Green Contexts to enable dynamic GPU resource partitioning for disaggregated AI inference workloads, while previewing future multi-node CUDA graphs that will orchestrate computations across entire data centers.
🔀 The Shift from Symmetric to Asymmetric Parallelism 4 insights
Traditional symmetric execution limits utilization
Conventional CUDA grid launches run identical workloads across all 160 SMs of a Blackwell GPU sequentially, preventing simultaneous execution of different tasks.
AI inference phases have opposing resource needs
Prefill phases are compute-bound requiring matrix operations, while decode phases are memory bandwidth-bound, making uniform provisioning inefficient.
Disaggregation delivers 10x performance gains
Running prefill and decode workers on separately configured GPU partitions eliminates resource starvation and right-sizes hardware for each phase.
Dynamic orchestration manages unpredictable workloads
NVIDIA Dynamo orchestrates these disaggregated systems, dynamically balancing resources between context-heavy queries and token-generation-heavy reasoning tasks.
🟢 Green Contexts: Dynamic GPU Partitioning 4 insights
Green contexts bridge streams and MPS
This new mechanism sits between CUDA streams (too opportunistic) and Multi-Process Service (too static) to enable guaranteed asymmetric parallelism within a single process.
Sandboxed resource allocation without code changes
Developers create descriptors to partition SMs (e.g., dividing Blackwell's 160 units), and kernels run oblivious to their constrained sandboxes, enabling true multiplexing.
Graphs span multiple green contexts
CUDA graphs can now capture workflows targeting different green contexts, allowing single-launch orchestration of heterogeneous tasks across partitioned GPU resources.
Enables dynamic reconfiguration patterns
Green contexts support nested hierarchies, low-latency reservations, and dynamic repartitioning to respond to changing workload demands without restarting applications.
🌐 Future: Data Center-Scale CUDA 3 insights
Graphs will span racks and data centers
NVIDIA aims to extend CUDA graphs beyond single nodes to orchestrate work across NVLink-72 racks and eventually 100,000+ GPU data centers as unified compute fabrics.
System-level naming and topology required
Multi-node execution requires CUDA to provide consistent naming conventions and topology awareness across complex dragonfly networks so all nodes agree on resource locations.
Centralized control without centralized bottlenecks
Future CUDA will enable single-controller orchestration of disaggregated workloads while maintaining fine-grained guarantees about where and when computations execute across the infrastructure.
Bottom Line
Adopt Green Contexts now to dynamically partition GPUs for asymmetric inference workloads, positioning applications for future data center-scale CUDA orchestration.
More from NVIDIA AI Podcast
View all
The State of Open Source AI | NVIDIA GTC
Leading researchers and executives discuss how open source AI has evolved from a values-based movement into a viable commercial ecosystem, with companies like NVIDIA, Databricks, and Hugging Face demonstrating that open-weight models and transparent research can drive both industry innovation and sustainable business models through cloud services and foundation model programs.
AI Research Breakthroughs from NVIDIA Research (Hosted by Karoly of Two Minute Papers) | NVIDIA GTC
NVIDIA Research unveils breakthroughs shifting AI from imitation to exploration through Reinforcement Learning as Pre-training (RLP), open-sources the Alpamayo reasoning platform for autonomous vehicles, and demonstrates real-time generative world models and neural physics simulators enabling zero-shot sim-to-real robotics transfer.
Agentic AI 101 | NVIDIA GTC
This session traces the rapid evolution of AI from simple chatbots to autonomous 'agentic' systems capable of reasoning, coding new abilities, and collaborating in multi-agent networks, while demonstrating how developers can now build functional AI agents using modular tools and NVIDIA's open blueprints.
NVIDIA Nemotron Unpacked: Build, Fine-Tune, and Deploy Open Models From NVIDIA
NVIDIA's Nemotron project represents a strategic shift toward open-source AI development, releasing not just large language models (Nano, Super, Ultra) but complete training datasets, algorithms, and techniques to accelerate the entire ecosystem while informing NVIDIA's future hardware designs.