Accelerate AI through Open Source Inference | NVIDIA GTC

| Podcasts | April 11, 2026 | 2.29 Thousand views | 48:21

TL;DR

Industry leaders from NVIDIA, Hugging Face, Mistral AI, Black Forest Labs, and Lightricks discuss how open-source inference optimization—spanning quantization, latent compression, and Mixture of Experts architectures—is enabling both massive trillion-parameter models and efficient edge deployment while driving the shift toward sovereign AI and local data control.

Inference Optimization Techniques 3 insights

Latent space compression outperforms quantization limits

Deeply compressed latent spaces in diffusion models reduce token overhead more effectively than post-hoc quantization, as seen in Lightricks' LTX models and NVIDIA's DCGM/DCV research.

Hardware-software co-design drives efficiency gains

Innovations like NVIDIA's FP4 precision and custom CUDA kernels from Deci AI complement algorithmic advances such as speculative decoding to maximize tokens per watt.

Text generation may shift beyond autoregressive patterns

Patrick von Platen suggests diffusion mechanisms could eventually replace autoregressive token prediction for language models, similar to current image generation paradigms.

📊 Model Scaling and Architecture 3 insights

Ecosystem diverging toward extreme scale and edge efficiency

While trillion-parameter models like Qwen2.5 emerge for complex reasoning, Hugging Face hosts over 2.5 million open models including small dense variants optimized for local laptops and phones.

Mixture of Experts dominates high-throughput serving

MoE architectures like Mistral 7B-Instruct-v0.2-MoE activate only fractions of parameters per request, enabling efficient parallel processing despite memory constraints.

Distillation creates model families from single large bases

Tim Dockhorn notes that Black Forest Labs easily distills large models into multiple sizes, allowing SDXL to maintain 2 million monthly downloads two years post-release.

🛡️ Sovereign AI and Open Infrastructure 2 insights

Open-source inference enables data sovereignty

Patrick von Platen emphasizes that inference servers like vLLM and SGLang allow companies to deploy models locally without transferring sensitive data to external APIs.

Quantization democratizes access to large models

Lightricks' 19-billion-parameter LTX 2 model runs on consumer GPUs with 6GB VRAM using 4-bit quantization, enabling community-driven optimization that feeds back into data center efficiency.

🎨 Multimodal and Consumer Applications 2 insights

Iterative reduction enables real-time media generation

Black Forest Labs distills diffusion processes from 20-30 iterations down to 4-8 steps to support interactive video workflows where users need rapid creative iteration.

AI services evolve into orchestrated multi-model systems

Jeff Boudier explains that modern applications combine specialized models—such as Flux for image generation, classifiers for routing, and guardrails for safety—rather than relying on single monolithic LLMs.

Bottom Line

Organizations should invest in open-source inference infrastructure featuring quantization, latent compression, and MoE support to deploy AI with full data sovereignty while optimizing for both massive cloud workloads and efficient consumer-edge execution.

More from NVIDIA AI Podcast

View all
Build Video Analytics AI Agents with Skills
59:53
NVIDIA AI Podcast NVIDIA AI Podcast

Build Video Analytics AI Agents with Skills

NVIDIA introduces the Video Search and Summarization (VSS) blueprint for building vision AI agents that process billions of camera streams using vision language models and a new 'skills' framework, enabling deep video search and summarization 60x faster than manual review.

15 days ago · 9 points
Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs
48:56
NVIDIA AI Podcast NVIDIA AI Podcast

Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs

NVIDIA researchers detail the development of Nemotron 3 Nano Omni, explaining how they evolved a text-only model into a multimodal system capable of processing vision, audio, and video through progressive training stages while maintaining the hybrid Mamba-Transformer architecture.

16 days ago · 10 points
Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture
51:38
NVIDIA AI Podcast NVIDIA AI Podcast

Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture

NVIDIA researchers Lynn Chai and Luc introduce TensorRT Edge LLM, a purpose-built inference engine for deploying large language models on Jetson edge devices, showcasing NVFP4 quantization and speculative decoding techniques that achieve up to 7x faster prefill speeds and 500 tokens per second generation while previewing a simplified vLLM-style Python API coming soon.

23 days ago · 10 points
March 10 - Jetson AI Lab Research Group Call - Lightning talks
55:28
NVIDIA AI Podcast NVIDIA AI Podcast

March 10 - Jetson AI Lab Research Group Call - Lightning talks

This Jetson AI Lab Research Group call features lightning talks on open-source hardware for remote Jetson access, a real-time emotional AI engine for robots running entirely on Jetson Nano, and updates to the Jetson AI Lab model repository with new performance benchmarks and deployment guides.

24 days ago · 8 points