Insights from NVIDIA Research | NVIDIA GTC
TL;DR
NVIDIA Research reveals architectural breakthroughs targeting 16,000 tokens/sec inference speeds through radical data movement reduction, while recounting how the 500-person team previously pioneered the company's AI, networking, and ray tracing transformations.
🏗️ Research Legacy & Impact 4 insights
Dual-sided organization structure
NVIDIA Research operates 500 people across 'supply side' (GPU technology from circuits to programming) and 'demand side' (AI, robotics, quantum applications driving GPU adoption).
AI hardware genesis
Collaboration with Andrew Ng and Bryan Catanzaro ported deep learning from 16,000 CPUs to 12 GPUs, creating cuDNN and establishing NVIDIA's AI leadership.
Networking pivot against executive resistance
DOE-funded research created NVLink and NVSwitch after Jensen initially rejected networking investments, with technology migrating from research to Pascal and Volta GPUs.
RTX cores moonshot origin
The 'tree traversal unit' research project achieved 100x ray tracing speedup through specialized hardware, rebranded as RTX cores for real-time graphics.
⚡ The Inference Bottleneck 3 insights
Latency versus throughput spectrum
Batch processing prioritizes tokens-per-dollar (left side) while real-time agentic AI requires interactivity (right side) demanding 100 to 10,000+ tokens per second per user.
Communication dominates latency
For real-time inference, 89% of time is spent on communication versus only 11% on compute and memory, with 500 communication stages per token across 80 layers.
Memory bandwidth constraints
Decode phase inference requires reading every model weight for each single token, creating a memory bandwidth bottleneck that limits throughput.
🔬 Hardware Architecture Innovations 4 insights
SRAM-compute fusion
Placing arithmetic units directly at SRAM edges eliminates data movement by performing dot products immediately upon weight retrieval across tiled processing elements.
Static scheduling for speed
Eliminating queuing, arbitration, and routing decisions enables 50 nanosecond on-chip communication latency by advancing activations over wires in pre-determined paths.
Low-latency off-chip links
Reducing bandwidth from 400 to 200 Gbps removes complex DSP and forward error correction, achieving ~100 nanosecond switch traversal versus current multi-microsecond delays.
3D DRAM stacking
Placing DRAM directly atop GPU dies with localized storage above each processing element eliminates data movement energy, targeting 10x reduction in joules per token.
🎯 Performance Targets 2 insights
10,000+ token velocity goal
Current systems achieve ~350 tokens/sec while the research prototype targets 16,000 tokens/sec per user to enable real-time reasoning and tree-of-thought AI agents.
Spatial KV cache distribution
Pipelining architecture keeps portions of the KV cache localized to specific chips, minimizing energy-intensive off-chip data movement for batch workloads.
Bottom Line
The future of AI inference requires sacrificing raw bandwidth for ultra-low latency through static scheduling and 3D memory integration, potentially delivering 10x efficiency gains and 16,000 tokens/second to enable real-time agentic systems.
More from NVIDIA AI Podcast
View all
Build, Optimize, Run: The Developer's Guide to Local Gen AI on NVIDIA RTX AI PCs
NVIDIA is driving a paradigm shift from cloud-based LLMs to local small language models (SLMs) on RTX GPUs, enabling personalized agentic AI with full data privacy. Through advanced quantization and tools like Olama, developers can now run sophisticated coding agents and creative assistants entirely on local hardware with 11x performance gains over competitors.
The State of Open Source AI | NVIDIA GTC
Leading researchers and executives discuss how open source AI has evolved from a values-based movement into a viable commercial ecosystem, with companies like NVIDIA, Databricks, and Hugging Face demonstrating that open-weight models and transparent research can drive both industry innovation and sustainable business models through cloud services and foundation model programs.
AI Research Breakthroughs from NVIDIA Research (Hosted by Karoly of Two Minute Papers) | NVIDIA GTC
NVIDIA Research unveils breakthroughs shifting AI from imitation to exploration through Reinforcement Learning as Pre-training (RLP), open-sources the Alpamayo reasoning platform for autonomous vehicles, and demonstrates real-time generative world models and neural physics simulators enabling zero-shot sim-to-real robotics transfer.
CUDA: New Features and Beyond | NVIDIA GTC
This presentation outlines CUDA's evolution toward 'guaranteed asymmetric parallelism,' introducing Green Contexts to enable dynamic GPU resource partitioning for disaggregated AI inference workloads, while previewing future multi-node CUDA graphs that will orchestrate computations across entire data centers.