Insights from NVIDIA Research | NVIDIA GTC

| Podcasts | April 06, 2026 | 16.7 Thousand views | 38:18

TL;DR

NVIDIA Research reveals architectural breakthroughs targeting 16,000 tokens/sec inference speeds through radical data movement reduction, while recounting how the 500-person team previously pioneered the company's AI, networking, and ray tracing transformations.

🏗️ Research Legacy & Impact 4 insights

Dual-sided organization structure

NVIDIA Research operates 500 people across 'supply side' (GPU technology from circuits to programming) and 'demand side' (AI, robotics, quantum applications driving GPU adoption).

AI hardware genesis

Collaboration with Andrew Ng and Bryan Catanzaro ported deep learning from 16,000 CPUs to 12 GPUs, creating cuDNN and establishing NVIDIA's AI leadership.

Networking pivot against executive resistance

DOE-funded research created NVLink and NVSwitch after Jensen initially rejected networking investments, with technology migrating from research to Pascal and Volta GPUs.

RTX cores moonshot origin

The 'tree traversal unit' research project achieved 100x ray tracing speedup through specialized hardware, rebranded as RTX cores for real-time graphics.

The Inference Bottleneck 3 insights

Latency versus throughput spectrum

Batch processing prioritizes tokens-per-dollar (left side) while real-time agentic AI requires interactivity (right side) demanding 100 to 10,000+ tokens per second per user.

Communication dominates latency

For real-time inference, 89% of time is spent on communication versus only 11% on compute and memory, with 500 communication stages per token across 80 layers.

Memory bandwidth constraints

Decode phase inference requires reading every model weight for each single token, creating a memory bandwidth bottleneck that limits throughput.

🔬 Hardware Architecture Innovations 4 insights

SRAM-compute fusion

Placing arithmetic units directly at SRAM edges eliminates data movement by performing dot products immediately upon weight retrieval across tiled processing elements.

Static scheduling for speed

Eliminating queuing, arbitration, and routing decisions enables 50 nanosecond on-chip communication latency by advancing activations over wires in pre-determined paths.

Low-latency off-chip links

Reducing bandwidth from 400 to 200 Gbps removes complex DSP and forward error correction, achieving ~100 nanosecond switch traversal versus current multi-microsecond delays.

3D DRAM stacking

Placing DRAM directly atop GPU dies with localized storage above each processing element eliminates data movement energy, targeting 10x reduction in joules per token.

🎯 Performance Targets 2 insights

10,000+ token velocity goal

Current systems achieve ~350 tokens/sec while the research prototype targets 16,000 tokens/sec per user to enable real-time reasoning and tree-of-thought AI agents.

Spatial KV cache distribution

Pipelining architecture keeps portions of the KV cache localized to specific chips, minimizing energy-intensive off-chip data movement for batch workloads.

Bottom Line

The future of AI inference requires sacrificing raw bandwidth for ultra-low latency through static scheduling and 3D memory integration, potentially delivering 10x efficiency gains and 16,000 tokens/second to enable real-time agentic systems.

More from NVIDIA AI Podcast

View all
Build Video Analytics AI Agents with Skills
59:53
NVIDIA AI Podcast NVIDIA AI Podcast

Build Video Analytics AI Agents with Skills

NVIDIA introduces the Video Search and Summarization (VSS) blueprint for building vision AI agents that process billions of camera streams using vision language models and a new 'skills' framework, enabling deep video search and summarization 60x faster than manual review.

10 days ago · 9 points
Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs
48:56
NVIDIA AI Podcast NVIDIA AI Podcast

Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs

NVIDIA researchers detail the development of Nemotron 3 Nano Omni, explaining how they evolved a text-only model into a multimodal system capable of processing vision, audio, and video through progressive training stages while maintaining the hybrid Mamba-Transformer architecture.

11 days ago · 10 points
Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture
51:38
NVIDIA AI Podcast NVIDIA AI Podcast

Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture

NVIDIA researchers Lynn Chai and Luc introduce TensorRT Edge LLM, a purpose-built inference engine for deploying large language models on Jetson edge devices, showcasing NVFP4 quantization and speculative decoding techniques that achieve up to 7x faster prefill speeds and 500 tokens per second generation while previewing a simplified vLLM-style Python API coming soon.

19 days ago · 10 points
March 10 - Jetson AI Lab Research Group Call - Lightning talks
55:28
NVIDIA AI Podcast NVIDIA AI Podcast

March 10 - Jetson AI Lab Research Group Call - Lightning talks

This Jetson AI Lab Research Group call features lightning talks on open-source hardware for remote Jetson access, a real-time emotional AI engine for robots running entirely on Jetson Nano, and updates to the Jetson AI Lab model repository with new performance benchmarks and deployment guides.

19 days ago · 8 points