Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture

NVIDIA AI Podcast

| Podcasts | May 04, 2026 | 107 views | 51:38

TL;DR

NVIDIA researchers Lynn Chai and Luc introduce TensorRT Edge LLM, a purpose-built inference engine for deploying large language models on Jetson edge devices, showcasing NVFP4 quantization and speculative decoding techniques that achieve up to 7x faster prefill speeds and 500 tokens per second generation while previewing a simplified vLLM-style Python API coming soon.

🚀 TensorRT Edge LLM Platform Strategy 3 insights

Embedded-first architecture vs datacenter

Unlike TensorRT LLM which targets multi-node data centers, TRT Edge LLM is specifically designed for resource-constrained NVIDIA embedded platforms including Jetson Orin, Thor, DGX Spark, and GeForce GPUs with minimal dependencies and predictable latency.

Open-source C++ runtime with Python bindings

The engine provides a production-grade C++ runtime optimized for real-time applications with automotive safety options, featuring KV cache reuse, paged KV cache, and LoRA support for dynamic switching between use cases.

Comprehensive model support

Currently supports Llama, Qwen, NVIDIA Nemotron, and Cosmos, with Gemma 4 and Nemotron 30B on the roadmap, plus planned native integration with NVIDIA's TensorRT Model Optimizer (TRT MO).

⚡ Performance Optimizations & Benchmarks 3 insights

NVFP4 quantization delivers 2-7x speedup

NVFP4 quantization leverages specialized tensor cores to achieve up to 7x faster prefill compared to INT4 AWQ, while maintaining competitive generation speeds of approximately 50 tokens/sec for 8B models and 300 tokens/sec for 0.6B models.

Speculative decoding with Eagle-3 algorithm

Edge-optimized speculative decoding provides 3-4x performance gains on small batch sizes, reaching nearly 500 tokens per second with spec size 8; native Multi-Token Prediction (MTP) support for Qwen 3.5 is planned.

MLPerf validation

The team showcased these results in MLPerf inference benchmarks, demonstrating the engine's capability for high-throughput edge AI workloads.

🛠️ Developer Experience & Roadmap 3 insights

Current ONNX-based workflow

Today's workflow requires exporting models to ONNX on x86 hosts, then building TensorRT engines on the target device with specific plugins for Jetson Orin, Thor, or DGX Spark (FP8/FP4 supported on Thor and Spark only).

vLLM-style Python API coming

A high-level Python API under development will enable one-line deployment with automatic model downloading, ONNX export, engine building, and artifact caching directly on the edge device.

OpenAI-compatible server

An upcoming HTTP server will support streaming, chat completions, ASR/TTS workflows, and multimodal interactions, enabling personal AI agents and robotics applications with multi-turn reasoning and KV cache reuse.

🚗 Safety & Automotive Applications 2 insights

Safety certification roadmap

NVIDIA is working with select automotive customers to deliver safety-certifiable features by 2026, with the long-term goal of certifying all TRT Edge LLM features for safety-critical applications.

Robotics and agentic workflows

The engine supports C++ APIs for robotics task autonomy and will enable visual recognition and scene understanding through multimodal LLM support for autonomous systems.

Bottom Line

Developers can soon deploy production LLMs on Jetson devices using just 2-3 lines of Python code while achieving datacenter-level performance through NVFP4 quantization and speculative decoding, eliminating the complex ONNX export workflows currently required.

Watch on YouTube

More from NVIDIA AI Podcast

March 10 - Jetson AI Lab Research Group Call - Lightning talks

NVIDIA AI Podcast

March 10 - Jetson AI Lab Research Group Call - Lightning talks

This Jetson AI Lab Research Group call features lightning talks on open-source hardware for remote Jetson access, a real-time emotional AI engine for robots running entirely on Jetson Nano, and updates to the Jetson AI Lab model repository with new performance benchmarks and deployment guides.

about 9 hours ago · 8 points

Feb 10 - Jetson AI Lab Research Group Call - Drones on Jetson & Isaac Lab on DGX Spark

NVIDIA AI Podcast

Feb 10 - Jetson AI Lab Research Group Call - Drones on Jetson & Isaac Lab on DGX Spark

Cameron Rose presents 'Operation Squirrel,' an autonomous drone project using Jetson Orin Nano for real-time target tracking and dynamic payload delivery. The system uses a modular C++ software stack with TensorRT-optimized YOLO and OSNet running at 21 FPS, communicating via UART with a flight controller to maintain following distance through velocity commands.

about 9 hours ago · 9 points

Jan 13: Jetson AI Lab Research Group Call - Accelerating Robotics with Isaac ROS on Jetson

NVIDIA AI Podcast

Jan 13: Jetson AI Lab Research Group Call - Accelerating Robotics with Isaac ROS on Jetson

NVIDIA's Isaac ROS team explains how their NITROS framework eliminates costly GPU memory copies in ROS 2 to enable a new era of "Physical AI" where end-to-end learned policies replace traditional robotic control, requiring tight integration of accelerated computing from simulation to deployment on Jetson.

about 9 hours ago · 8 points

Generating Performant 6G GPU-Accelerated Code From High-Level Programming Languages

NVIDIA AI Podcast

Generating Performant 6G GPU-Accelerated Code From High-Level Programming Languages

NVIDIA's Aerial Framework enables 6G researchers to write radio access network algorithms in Python/JAX and compile them directly to GPU-accelerated TensorRT engines, eliminating the traditional rewrite-to-C++ bottleneck while meeting sub-500-microsecond real-time latency requirements for over-the-air testing.

about 11 hours ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories