Inference Office Hours with SGLang: Performance Optimizations for LLM Serving
TL;DR
SGLang achieves breakthrough LLM serving performance on NVIDIA GB200 through PD disaggregation, mixed-precision kernels (FP8/NVFP4), and zero-overhead speculative decoding, delivering up to 1.9x speedups over H100 while maintaining accuracy and eliminating GPU idle bubbles.
🚀 GB200 Hardware & Precision Optimization 3 insights
NVLink Backbone Architecture
GB200 NV72 utilizes pure NVLink connectivity across 72 GPUs instead of InfiniBand, drastically reducing communication latency for distributed expert parallelism.
Quantization Performance Gains
Transitioning from BF16 to FP8 yields 1.8x speedup, while moving FP8 to NVFP4 delivers an additional 1.9x acceleration with negligible accuracy loss on GBQ and MASS 500 benchmarks.
Record Throughput Metrics
On GB200, SGLang achieves 26,000 input and 13,000 output tokens per second per GPU using 48 decode ranks, significantly surpassing previous generation baselines.
⚙️ System Architecture & Scheduling 3 insights
PD Disaggregation Strategy
Separating prefill and decode into distinct node types prevents task preemption, enables balanced data parallel attention with VSS, and utilizes RDMA for efficient KV cache transfer.
Two-Batch Overlap Technique
Splitting batches into microbatches enables simultaneous computation and communication across dual CUDA streams, hiding latency by overlapping attention operations with data transfers.
Optimized Kernel Selection
Prefill stages leverage Flash Infer Cutlass for FP8/NVFP4 gemms while decode stages utilize Deep Gemm and TensorRT-LLM attention kernels specifically optimized for Blackwell architecture.
⚡ Latency Elimination Strategies 3 insights
Zero-Overhead Speculative Decoding
Spec V2 eliminates GPU idle bubbles by running the overlap scheduler concurrently with speculative decoding via EagleWorkerV2, delivering 20% end-to-end speed improvements across all workloads.
Piecewise CUDA Graphs
Prefill operations now exclude attention kernels from CUDA graphs to avoid variable sequence padding, capturing only non-attention components while overlapping attention launch overhead with CPU work.
Unified Feature Compatibility
SGLang is refactoring memory pools and parallelism modules to ensure PD disaggregation, speculative decoding, and CUDA graphs can operate simultaneously using default configuration arguments.
Bottom Line
Deploy SGLang on GB200 clusters with PD disaggregation enabled, FP8 or NVFP4 precision modes active, and zero-overhead speculative decoding turned on to maximize inference throughput while minimizing latency.
More from NVIDIA AI Podcast
View all
Apr 14 - Jetson AI Lab Research Group Call - Tensor RT Edge LLM on Jetson & Culture
NVIDIA researchers Lynn Chai and Luc introduce TensorRT Edge LLM, a purpose-built inference engine for deploying large language models on Jetson edge devices, showcasing NVFP4 quantization and speculative decoding techniques that achieve up to 7x faster prefill speeds and 500 tokens per second generation while previewing a simplified vLLM-style Python API coming soon.
March 10 - Jetson AI Lab Research Group Call - Lightning talks
This Jetson AI Lab Research Group call features lightning talks on open-source hardware for remote Jetson access, a real-time emotional AI engine for robots running entirely on Jetson Nano, and updates to the Jetson AI Lab model repository with new performance benchmarks and deployment guides.
Feb 10 - Jetson AI Lab Research Group Call - Drones on Jetson & Isaac Lab on DGX Spark
Cameron Rose presents 'Operation Squirrel,' an autonomous drone project using Jetson Orin Nano for real-time target tracking and dynamic payload delivery. The system uses a modular C++ software stack with TensorRT-optimized YOLO and OSNet running at 21 FPS, communicating via UART with a flight controller to maintain following distance through velocity commands.
Jan 13: Jetson AI Lab Research Group Call - Accelerating Robotics with Isaac ROS on Jetson
NVIDIA's Isaac ROS team explains how their NITROS framework eliminates costly GPU memory copies in ROS 2 to enable a new era of "Physical AI" where end-to-end learned policies replace traditional robotic control, requiring tight integration of accelerated computing from simulation to deployment on Jetson.