Inference Office Hours with SGLang: Performance Optimizations for LLM Serving

| Podcasts | February 06, 2026 | 1.43 Thousand views | 41:10

TL;DR

SGLang achieves breakthrough LLM serving performance on NVIDIA GB200 through PD disaggregation, mixed-precision kernels (FP8/NVFP4), and zero-overhead speculative decoding, delivering up to 1.9x speedups over H100 while maintaining accuracy and eliminating GPU idle bubbles.

🚀 GB200 Hardware & Precision Optimization 3 insights

NVLink Backbone Architecture

GB200 NV72 utilizes pure NVLink connectivity across 72 GPUs instead of InfiniBand, drastically reducing communication latency for distributed expert parallelism.

Quantization Performance Gains

Transitioning from BF16 to FP8 yields 1.8x speedup, while moving FP8 to NVFP4 delivers an additional 1.9x acceleration with negligible accuracy loss on GBQ and MASS 500 benchmarks.

Record Throughput Metrics

On GB200, SGLang achieves 26,000 input and 13,000 output tokens per second per GPU using 48 decode ranks, significantly surpassing previous generation baselines.

⚙️ System Architecture & Scheduling 3 insights

PD Disaggregation Strategy

Separating prefill and decode into distinct node types prevents task preemption, enables balanced data parallel attention with VSS, and utilizes RDMA for efficient KV cache transfer.

Two-Batch Overlap Technique

Splitting batches into microbatches enables simultaneous computation and communication across dual CUDA streams, hiding latency by overlapping attention operations with data transfers.

Optimized Kernel Selection

Prefill stages leverage Flash Infer Cutlass for FP8/NVFP4 gemms while decode stages utilize Deep Gemm and TensorRT-LLM attention kernels specifically optimized for Blackwell architecture.

Latency Elimination Strategies 3 insights

Zero-Overhead Speculative Decoding

Spec V2 eliminates GPU idle bubbles by running the overlap scheduler concurrently with speculative decoding via EagleWorkerV2, delivering 20% end-to-end speed improvements across all workloads.

Piecewise CUDA Graphs

Prefill operations now exclude attention kernels from CUDA graphs to avoid variable sequence padding, capturing only non-attention components while overlapping attention launch overhead with CPU work.

Unified Feature Compatibility

SGLang is refactoring memory pools and parallelism modules to ensure PD disaggregation, speculative decoding, and CUDA graphs can operate simultaneously using default configuration arguments.

Bottom Line

Deploy SGLang on GB200 clusters with PD disaggregation enabled, FP8 or NVFP4 precision modes active, and zero-overhead speculative decoding turned on to maximize inference throughput while minimizing latency.

More from NVIDIA AI Podcast

View all
Physical AI in Action With NVIDIA Cosmos Reason | Cosmos Labs
NVIDIA AI Podcast NVIDIA AI Podcast

Physical AI in Action With NVIDIA Cosmos Reason | Cosmos Labs

NVIDIA Cosmos Reason 2 enables physical AI systems to interpret the physical world through structured reasoning and common sense. The session highlights Milestone Systems' deployment of fine-tuned models for smart city traffic analytics, achieving automated incident detection and reporting at city scale.

about 1 month ago · 10 points
Intro to NVIDIA Cosmos with Ming-Yu ft. Superintelligence | Cosmos Labs
49:03
NVIDIA AI Podcast NVIDIA AI Podcast

Intro to NVIDIA Cosmos with Ming-Yu ft. Superintelligence | Cosmos Labs

NVIDIA Cosmos is an open world foundation model that generates synthetic training environments to solve the data scarcity bottleneck in physical AI, essentially creating 'The Matrix for robots' where machines learn visual-motor skills through interactive simulation before real-world deployment.

about 1 month ago · 10 points
How To Adapt AI for Low-Resource Languages with NVIDIA Nemotron
48:30
NVIDIA AI Podcast NVIDIA AI Podcast

How To Adapt AI for Low-Resource Languages with NVIDIA Nemotron

This video demonstrates how Dicta adapted NVIDIA's open Nemotron models to create a high-performing Hebrew language AI, solving critical tokenization inefficiencies and reasoning gaps that plague low-resource languages in mainstream models like GPT-4.

about 1 month ago · 10 points