Ask the Experts: Nemotron 3 Nano Omni | Nemotron Labs

| Podcasts | May 12, 2026 | 1.23 Thousand views | 48:56

TL;DR

NVIDIA researchers detail the development of Nemotron 3 Nano Omni, explaining how they evolved a text-only model into a multimodal system capable of processing vision, audio, and video through progressive training stages while maintaining the hybrid Mamba-Transformer architecture.

🏗️ Architecture & Design 3 insights

Hybrid Mamba-Transformer foundation preserved

The model retains the 17-layer hybrid architecture of Nemotron Nano v3 to maintain text capabilities while adding vision and audio support.

Interleaved multimodal token processing

Vision, text, and audio tokens are interleaved before entering the LLM, enabling the model to attend to all modalities simultaneously.

Specialized encoders for new modalities

The architecture incorporates a CRADIO encoder for vision and a dedicated Perceiver-style encoder for audio inputs.

🎯 Training Methodology 3 insights

Progressive staged training approach

Training progressed from document processing (V1) to grounding/QA (V2) to audio-visual (Omni) stages to prevent catastrophic forgetting of previous capabilities.

Synthetic data generation pipelines

Researchers generated fully synthetic question-answer pairs for underrepresented tasks and created targeted datasets to improve performance on specific benchmarks.

Reinforcement learning optimization

Post-supervised fine-tuning stages employed DPO for preference optimization and GRPO to further enhance reasoning alignment and capabilities.

🎬 Core Capabilities 3 insights

Audio-visual captioning and summarization

The model excels at generating holistic, scene-by-scene summaries of videos that integrate both audio and visual information into coherent reports.

Cross-modal temporal reasoning

It can correlate events across audio and visual streams, such as identifying what happens visually when specific audio cues occur in a video.

Fine-grained entity interaction analysis

Strong performance in tracking detailed human-human and human-object interactions throughout complex video sequences.

📊 Benchmark Performance 2 insights

Long context and OCR leadership

Ranks at the top of MML Long Bench for extended sequence reasoning and demonstrates continued strength on OCR Bench 2.

Dramatic GUI understanding improvements

Screen Spot Pro scores improved from single digits to approximately 60, with competitive first-time performance on OS World.

Bottom Line

Nemotron 3 Nano Omni achieves advanced multimodal understanding by preserving its hybrid text architecture while progressively adding vision and audio through carefully staged training and synthetic data generation, making it ideal for detailed video analysis and cross-modal reasoning tasks.

More from NVIDIA AI Podcast

View all
Build Video Analytics AI Agents with Skills
59:53
NVIDIA AI Podcast NVIDIA AI Podcast

Build Video Analytics AI Agents with Skills

NVIDIA introduces the Video Search and Summarization (VSS) blueprint for building vision AI agents that process billions of camera streams using vision language models and a new 'skills' framework, enabling deep video search and summarization 60x faster than manual review.

1 day ago · 9 points
Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture
51:38
NVIDIA AI Podcast NVIDIA AI Podcast

Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture

NVIDIA researchers Lynn Chai and Luc introduce TensorRT Edge LLM, a purpose-built inference engine for deploying large language models on Jetson edge devices, showcasing NVFP4 quantization and speculative decoding techniques that achieve up to 7x faster prefill speeds and 500 tokens per second generation while previewing a simplified vLLM-style Python API coming soon.

10 days ago · 10 points
March 10 - Jetson AI Lab Research Group Call - Lightning talks
55:28
NVIDIA AI Podcast NVIDIA AI Podcast

March 10 - Jetson AI Lab Research Group Call - Lightning talks

This Jetson AI Lab Research Group call features lightning talks on open-source hardware for remote Jetson access, a real-time emotional AI engine for robots running entirely on Jetson Nano, and updates to the Jetson AI Lab model repository with new performance benchmarks and deployment guides.

10 days ago · 8 points
Feb 10 - Jetson AI Lab Research Group Call - Drones on Jetson & Isaac Lab on DGX Spark
57:34
NVIDIA AI Podcast NVIDIA AI Podcast

Feb 10 - Jetson AI Lab Research Group Call - Drones on Jetson & Isaac Lab on DGX Spark

Cameron Rose presents 'Operation Squirrel,' an autonomous drone project using Jetson Orin Nano for real-time target tracking and dynamic payload delivery. The system uses a modular C++ software stack with TensorRT-optimized YOLO and OSNet running at 21 FPS, communicating via UART with a flight controller to maintain following distance through velocity commands.

10 days ago · 9 points