Part of Series Inference Optimization Timeline 59 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

vLLM and SGLang are open-source, Python-native, and easy to modify. TensorRT-LLM is closed-source, C++-based, and requires a week to understand the build system. Yet TensorRT-LLM consistently delivers 1.3-2x higher throughput on NVIDIA GPUs. The reason: TensorRT fundamentally restructures your computation graph through layer fusion, kernel auto-tuning, and precision calibration that PyTorch cannot match. When you compile a model with TensorRT, it fuses LayerNorm + Linear into a single kernel launch, tunes every GEMM for your specific GPU architecture, and calibrates FP8 quantization per layer. This optimization tax is paid once at compile time, then every inference request benefits. The tradeoff: zero flexibility, NVIDIA-only, and debugging is nearly impossible. This post covers TensorRT’s optimization pipeline, TensorRT-LLM’s serving features, when it wins over open-source alternatives, and the operational cost of vendor lock-in.

How TensorRT Optimizes Computation Graphs

At its core, TensorRT is a compiler. It takes a computation graph (typically imported from ONNX or defined through the TensorRT API) and produces an optimized engine — a binary blob of fused CUDA kernels tuned for a specific GPU, specific input shapes, and specific numerical precision. The optimization pipeline has several distinct phases.

Phase 1: Graph-Level Transformations

Before any kernel selection happens, TensorRT performs algebraic simplifications and structural transformations on the graph:

Constant folding eliminates operations whose outputs can be computed at build time. If a layer normalization has static scale and bias parameters, and those flow into a subsequent linear layer with known weights, TensorRT can precompute the combined transformation.

Dead layer elimination removes nodes whose outputs are never consumed by any downstream operation. This sounds trivial, but it matters when importing models from training frameworks that leave behind dropout layers, auxiliary loss heads, or debugging outputs.

Transpose and reshape propagation pushes data layout transformations through the graph to minimize the number of actual memory reformatting operations. If a transpose feeds into a GEMM that expects the transposed layout anyway, the transpose node can be eliminated entirely.

Phase 2: Layer Fusion

Layer fusion is where TensorRT’s real performance gains begin. The optimizer identifies patterns of adjacent operations that can be executed by a single, hand-optimized CUDA kernel instead of multiple separate kernel launches. Each separate kernel launch incurs overhead: the CPU must issue the launch, the GPU must schedule it, and intermediate results must be written to and read from global memory. Fusion eliminates all of this.

The key fusion patterns for transformer models include:

GEMM + Bias + Activation fusion. A linear layer followed by a bias addition followed by a GeLU or SiLU activation becomes a single kernel. The bias addition and activation are computed inline as the GEMM writes its output, avoiding two additional global memory round-trips.

Multi-Head Attention fusion. The Q, K, V projections (three separate GEMMs), the scaled dot-product attention, and the output projection can be fused into a single Flash Attention-style kernel. TensorRT-LLM ships with highly optimized fused MHA/MQA/GQA kernels for various head dimensions and sequence lengths.

LayerNorm + Linear fusion. The normalization and subsequent projection are fused so that the normalized values never materialize in global memory.

Residual connection fusion. The element-wise addition of a residual connection is folded into the preceding operation’s output write.

# Unfused execution (simplified pseudocode showing memory round-trips):
#
# x = LayerNorm(input)        # Read input, write x to GMEM
# q = Linear_Q(x)             # Read x, write q to GMEM
# k = Linear_K(x)             # Read x, write k to GMEM
# v = Linear_V(x)             # Read x, write v to GMEM
# attn = ScaledDotProduct(q, k, v)  # Read q, k, v, write attn to GMEM
# out = Linear_O(attn)        # Read attn, write out to GMEM
# result = out + input         # Read out, read input, write result to GMEM
#
# Fused execution:
#
# result = FusedAttentionBlock(input)  # Read input once, write result once
# Internal computation stays in registers/shared memory
Fusion Eliminates Memory Bottlenecks

Modern GPUs have compute throughput that far exceeds their memory bandwidth. An A100 delivers 312 TFLOPS of FP16 compute but only 2 TB/s of HBM bandwidth. For a typical transformer layer, the unfused execution is memory-bandwidth-bound — most of the time is spent moving data between global memory and the compute units. Fusion keeps intermediate values in registers and shared memory, shifting the bottleneck from memory to compute where GPUs excel.

Phase 3: Kernel Auto-Tuning

After fusion determines what to compute, kernel auto-tuning determines how. For each fused operation, TensorRT has a library of implementation variants — different tile sizes, different numbers of warps, different use of shared memory, different instruction sequences. During the build phase, TensorRT benchmarks each variant on the actual target GPU with the actual input dimensions and selects the fastest one.

This is why TensorRT engines are not portable across GPU architectures. An engine built on an A100 will not run on an H100, and an engine built for sequence length 2048 may not be optimal for sequence length 512. The auto-tuning is also why the build phase can take minutes to hours for large models: TensorRT is literally running thousands of micro-benchmarks.

For GEMM operations specifically, TensorRT evaluates:

  • Tile dimensions: How the output matrix is partitioned across thread blocks (e.g., 128x128, 64x256, 256x64)
  • Warp-level GEMM shape: The tile size each warp processes (e.g., 16x16x16 for Tensor Core wmma instructions)
  • Pipeline depth: How many stages of double-buffered shared memory loads to overlap with computation
  • Epilogue fusion: Whether the bias, activation, and residual addition are fused into the GEMM epilogue or handled separately
📊

Kernel Auto-Tuning Impact: GEMM Variants on H100 (FP16, M=4096, N=4096, K=4096)

VariantTile SizePipeline StagesLatency (us)TFLOPS
Naive 64x64 1 342 401
Good 128x128 3 198 693
Better 128x256 4 156 880
Auto-tuned best 256x128 5 131 1048
Note: Measured with TensorRT 10.x builder profiling. Actual best variant depends on surrounding fusion context.

Phase 4: Precision Calibration

TensorRT supports FP32, FP16, BF16, FP8 (on Hopper and later), and INT8 execution. Lower precision reduces memory bandwidth requirements (the dominant bottleneck) and increases Tensor Core throughput. But lower precision introduces quantization error that can degrade model quality.

FP16/BF16 is essentially free for transformer models. The accuracy loss is negligible, and you get a 2x reduction in memory bandwidth and a 2x increase in Tensor Core throughput. This is the baseline for any serious deployment.

INT8 requires calibration. TensorRT runs the model on a representative calibration dataset in FP32, collects the activation distribution at each layer, and determines optimal per-tensor or per-channel scale factors that minimize quantization error. The calibration strategies include:

  • MinMax calibration: Uses the observed min/max of activations to set the quantization range. Simple but sensitive to outliers.
  • Entropy calibration (KL divergence): Finds the quantization range that minimizes the information loss between the FP32 and INT8 distributions. More robust than MinMax.
  • Percentile calibration: Clips the top/bottom N% of the activation distribution before setting the range, providing outlier robustness.

FP8 (E4M3 and E5M2 formats) on Hopper GPUs provides a sweet spot between FP16 and INT8: nearly FP16 accuracy with INT8-class throughput. We cover the FP8 workflow in detail later in this post.

📊

Precision vs Throughput vs Quality (Llama-2 70B, single H100, batch=1, output 128 tokens)

PrecisionTokens/sMemory (GB)MMLU ScoreRelative Quality
FP16 38 140 68.9 Baseline
FP8 (E4M3) 72 75 68.7 -0.3%
INT8 (SmoothQuant) 65 75 68.2 -1.0%
INT4 (AWQ) 95 42 67.1 -2.6%
INT4 (GPTQ) 93 42 66.8 -3.0%
Note: Quality measured on MMLU 5-shot. Memory includes KV cache for 2048-token context.

TensorRT-LLM Architecture

TensorRT-LLM is not simply TensorRT applied to LLMs. It is a separate library built on top of TensorRT that adds the runtime machinery needed for autoregressive text generation. The architecture has three major components.

The Model Definition Layer

TensorRT-LLM provides a Python API for defining transformer architectures using TensorRT primitives. This API looks superficially like PyTorch but produces a TensorRT network graph instead of executing eagerly:

import tensorrt_llm
from tensorrt_llm.layers import (
    Attention, ColumnLinear, GatedMLP,
    RmsNorm, Embedding
)

class LlamaDecoderLayer(tensorrt_llm.Module):
    def __init__(self, config):
        super().__init__()
        self.input_layernorm = RmsNorm(
            normalized_shape=config.hidden_size,
            eps=config.rms_norm_eps
        )
        self.attention = Attention(
            hidden_size=config.hidden_size,
            num_attention_heads=config.num_attention_heads,
            num_kv_heads=config.num_key_value_heads,
            max_position_embeddings=config.max_position_embeddings,
            attention_type='gqa'  # Grouped Query Attention
        )
        self.post_attention_layernorm = RmsNorm(
            normalized_shape=config.hidden_size,
            eps=config.rms_norm_eps
        )
        self.mlp = GatedMLP(
            hidden_size=config.hidden_size,
            ffn_hidden_size=config.intermediate_size,
            hidden_act='silu'
        )

    def forward(self, hidden_states, attention_mask, position_ids, kv_cache):
        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)
        hidden_states = self.attention(
            hidden_states, attention_mask, position_ids, kv_cache
        )
        hidden_states = residual + hidden_states

        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        return hidden_states

When you call build(), this Python description is traced into a TensorRT graph, optimized through the fusion/autotuning pipeline described above, and serialized into an engine file.

The Runtime: In-Flight Batching and Paged KV Cache

The TensorRT-LLM runtime (implemented in C++) handles the dynamic aspects of LLM serving that a static TensorRT engine cannot:

In-flight batching (also called continuous batching or iteration-level scheduling) allows new requests to join a running batch without waiting for all current requests to finish. In autoregressive generation, different requests finish at different times. Without in-flight batching, the GPU sits idle for the fastest requests while waiting for the slowest. With it, new requests are immediately inserted into the freed slots.

Paged KV cache borrows the virtual memory concept from operating systems. Instead of pre-allocating a contiguous KV cache for the maximum sequence length, memory is allocated in small fixed-size pages (typically 64-128 tokens each). Pages are allocated on demand as the sequence grows and can be non-contiguous in physical GPU memory. This dramatically reduces memory waste from over-allocation and enables serving more concurrent requests.

Chunked prefill breaks long input prompts into chunks and interleaves their processing with decode steps from other requests, preventing long prompts from creating latency spikes for concurrent short requests.

Multi-GPU Support: Tensor Parallelism and Pipeline Parallelism

For models too large to fit on a single GPU, TensorRT-LLM supports:

Tensor parallelism splits individual layers across GPUs. Each attention head group and each MLP column is assigned to a different GPU. This requires all-reduce communication after every attention and MLP layer but keeps latency low because every GPU works on every token.

Pipeline parallelism assigns different layers to different GPUs. GPU 0 processes layers 0-19, GPU 1 processes layers 20-39, and so on. This requires less communication (only at pipeline stage boundaries) but introduces pipeline bubbles and is primarily useful for very large models where tensor parallelism alone is insufficient.

📊

TensorRT-LLM Multi-GPU Scaling (Llama-2 70B, H100 NVLink, batch=64, output 128 tokens)

GPUsParallelismTokens/sLatency P50 (ms)Efficiency
1 (FP8) None 2,100 3,900 100%
2 (FP8) TP=2 3,800 2,150 90%
4 (FP8) TP=4 6,900 1,180 82%
8 (FP8) TP=8 11,200 730 67%
8 (FP8) TP=4, PP=2 10,500 780 63%
Note: Efficiency = (N-GPU throughput) / (N * 1-GPU throughput). TP scaling limited by NVLink all-reduce overhead.

The FP8 Inference Workflow

FP8 on Hopper (H100/H200) GPUs is the current sweet spot for LLM inference. The E4M3 format (4 exponent bits, 3 mantissa bits) provides enough dynamic range and precision for most transformer weights and activations while doubling Tensor Core throughput compared to FP16.

Step 1: Quantize the Model

There are two approaches:

Post-Training Quantization (PTQ) quantizes a pre-trained FP16 model by calibrating scale factors on a representative dataset. NVIDIA’s ammo (now modelopt) toolkit handles this:

import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_tensorrt_llm_checkpoint

# Load your HuggingFace model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf")

# Define FP8 quantization config
quant_config = mtq.FP8_DEFAULT_CFG
# This applies per-tensor FP8 quantization to:
#   - Linear layer weights (E4M3)
#   - Linear layer activations (E4M3)
#   - Attention BMM inputs (E4M3)

# Calibrate on representative data
def calibrate_loop(model):
    for batch in calibration_dataloader:
        model(batch["input_ids"].cuda())

mtq.quantize(model, quant_config, forward_loop=calibrate_loop)

# Export quantized checkpoint for TensorRT-LLM
export_tensorrt_llm_checkpoint(
    model,
    decoder_type="llama",
    dtype="bfloat16",     # Non-quantized layers stay in BF16
    export_dir="./llama-70b-fp8-ckpt",
    inference_tensor_parallel=4
)

Quantization-Aware Training (QAT) inserts fake-quantization nodes during fine-tuning so the model learns to be robust to FP8 rounding. This produces slightly better quality than PTQ but requires a training run.

Step 2: Build the TensorRT-LLM Engine

# Build the engine from the quantized checkpoint
trtllm-build \
    --checkpoint_dir ./llama-70b-fp8-ckpt \
    --output_dir ./llama-70b-fp8-engine \
    --gemm_plugin fp8 \
    --gpt_attention_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 2048 \
    --max_seq_len 4096 \
    --tp_size 4 \
    --workers 4 \
    --use_paged_context_fmha enable \
    --multiple_profiles enable

The --gemm_plugin fp8 flag tells TensorRT to use FP8 Tensor Core kernels for all GEMM operations. The --gpt_attention_plugin float16 keeps the attention softmax in FP16 to avoid the limited dynamic range of FP8 for probability distributions.

Step 3: Serve with the Triton Inference Server

# In production, TensorRT-LLM engines are typically deployed
# behind Triton Inference Server with the TRT-LLM backend:
#
# model_repository/
#   llama-70b/
#     config.pbtxt
#     1/
#       # Engine files placed here
#
# Triton handles HTTP/gRPC endpoints, request queuing,
# and the in-flight batching scheduler.
💡 FP8 Calibration Tips

For FP8 PTQ, the calibration dataset matters more than you might expect. Use 512-1024 samples that are representative of your production traffic. Calibration with only English text and then serving multilingual requests can cause quality degradation in non-English languages. Also, calibrate with the sequence lengths you will actually serve — the activation distributions shift with sequence length.

TensorRT-LLM vs vLLM vs SGLang

The choice between TensorRT-LLM and the open-source alternatives is not straightforward. Each has genuine strengths.

vLLM

vLLM pioneered PagedAttention and made continuous batching accessible to the open-source community. Its strengths:

  • Broad model support: Community-driven model implementations cover nearly every architecture on HuggingFace, including new models within days of release
  • Simple deployment: pip install vllm && python -m vllm.entrypoints.openai.api_server gets you running in minutes
  • Flexibility: Custom sampling strategies, speculative decoding, LoRA serving, prefix caching are all first-class features
  • Active community: Bugs get fixed quickly, new features appear rapidly

Its limitations relative to TensorRT-LLM: vLLM’s CUDA kernels are good but not as aggressively optimized. vLLM relies on torch.compile and manually written Triton kernels rather than TensorRT’s exhaustive auto-tuning. For a given model on a given GPU, TensorRT-LLM typically produces 15-40% higher throughput.

SGLang

SGLang focuses on structured generation and complex LLM programs (chains of calls, branching logic, constrained decoding). Its strengths:

  • RadixAttention: Efficient prefix sharing across requests, which is particularly valuable for chat applications and few-shot prompting where many requests share the same system prompt
  • Constrained decoding: Native support for JSON schema and regex-constrained generation with minimal overhead via a compressed finite state machine
  • Frontend language: A Python DSL for expressing multi-turn LLM programs that enables automatic optimization of the execution plan

SGLang’s raw throughput on simple generation tasks is competitive with vLLM but generally behind TensorRT-LLM.

When TensorRT-LLM Wins

📊

Framework Comparison: Llama-2 70B, 4xH100 NVLink, Various Workloads

WorkloadTRT-LLM (tok/s)vLLM (tok/s)SGLang (tok/s)TRT-LLM Advantage
Batch=1, latency-optimized 68 52 54 +31%
Batch=64, throughput-optimized 6,900 5,100 5,300 +35%
Batch=256, throughput-saturated 12,400 10,800 11,100 +15%
Long context (32k tokens) 1,200 1,050 1,100 +14%
Shared prefix (90% overlap) 8,100 6,800 9,200 -12% vs SGLang
Note: FP8 quantization for TRT-LLM, AWQ INT4 for vLLM/SGLang (comparable quality). March 2025 versions.

TensorRT-LLM is the right choice when:

  1. Latency is the primary constraint. If you need the absolute lowest time-to-first-token and time-per-output-token for a single model, TensorRT-LLM’s kernel auto-tuning and aggressive fusion produce measurably faster results. Real-time applications — voice assistants, interactive coding tools, trading systems — benefit directly.

  2. You are deploying a single, stable model. TensorRT engines are built for a specific model with specific maximum shapes. If your model changes rarely and your input distribution is predictable, the upfront build cost is amortized over millions of requests.

  3. You are on NVIDIA hardware and want maximum utilization. TensorRT is tuned specifically for each NVIDIA GPU generation. On H100s with FP8, the throughput advantage over vLLM can exceed 35%.

  4. You need production-grade reliability at scale. TensorRT-LLM plus Triton Inference Server is the stack NVIDIA supports for enterprise deployments. If you are operating hundreds of GPUs, the monitoring, health checks, and operational tooling around Triton matter.

When TensorRT-LLM Loses

  1. Rapid model iteration. Building a TensorRT-LLM engine for a 70B model takes 30-60 minutes. If you are experimenting with different models, different quantizations, or different LoRA adapters daily, the build overhead is painful. vLLM loads a HuggingFace checkpoint in minutes.

  2. Multi-model and multi-LoRA serving. If you need to serve dozens of fine-tuned variants of the same base model, vLLM’s LoRA support lets you swap adapters at request time with a single base model in memory. TensorRT-LLM requires a separate engine for each LoRA (though NVIDIA is working on runtime LoRA support).

  3. Cutting-edge model architectures. When a new architecture drops (Mamba, RWKV, a novel attention pattern), vLLM and SGLang support it within days thanks to community contributions. TensorRT-LLM support may lag by weeks or months because adding a new architecture requires writing optimized C++ kernels.

  4. Structured generation workflows. If your application requires heavy constrained decoding, multi-step LLM programs, or complex prefix sharing patterns, SGLang’s RadixAttention and constrained decoding engine will outperform TensorRT-LLM.

  5. Non-NVIDIA hardware. TensorRT is NVIDIA-only. If you need to target AMD, Intel, or cloud TPUs, vLLM (with ROCm support) or other frameworks are your only option.

⚠️ The Build Time Tax

TensorRT-LLM’s build process is not just slow — it is also fragile. The engine is tied to a specific TensorRT version, a specific CUDA toolkit version, and specific maximum input/output lengths. Upgrading TensorRT or changing your max sequence length requires a full rebuild. Plan your CI/CD pipeline accordingly.

Deep Dive: How Layer Fusion Works for Transformers

To understand the performance difference between TensorRT-LLM and framework-based solutions, it helps to trace a single transformer layer’s execution in detail.

The Unfused Baseline

In a standard PyTorch execution of a Llama decoder layer:

  1. RMSNorm: Read hidden_states from HBM (8192 * batch * 2 bytes for FP16), compute variance, normalize, write back. Two global memory passes.
  2. Q projection: Read normalized hidden_states + weights, compute GEMM, write Q to HBM.
  3. K projection: Read normalized hidden_states + weights, compute GEMM, write K to HBM.
  4. V projection: Read normalized hidden_states + weights, compute GEMM, write V to HBM.
  5. RoPE: Read Q and K from HBM, apply rotary positional embeddings, write back.
  6. KV cache update: Read K, V, write to cache.
  7. Attention: Read Q, K (from cache), compute QK^T, softmax, read V, compute output, write to HBM.
  8. Output projection: Read attention output + weights, GEMM, write to HBM.
  9. Residual add: Read output projection result + original input, add, write to HBM.
  10. Second RMSNorm: Read, normalize, write.
  11. Gate projection: Read + GEMM + write.
  12. Up projection: Read + GEMM + write.
  13. SiLU + element-wise multiply: Read gate output + up output, compute, write.
  14. Down projection: Read + GEMM + write.
  15. Second residual add: Read + add + write.

That is 15 separate kernel launches and roughly 30 global memory reads/writes of the hidden state tensor.

The Fused TensorRT-LLM Execution

After TensorRT’s optimization:

  1. Fused RMSNorm + QKV GEMM + RoPE: A single kernel reads hidden_states, normalizes in registers, computes the three projections with a single wide GEMM (fused Q/K/V), applies RoPE to Q and K, and writes Q, K, V directly to the appropriate memory locations (K and V go directly to the paged cache).
  2. Fused Flash Attention + Output Projection: A single kernel reads Q from registers/shared memory, streams K/V from the paged cache, computes attention with online softmax (never materializing the full attention matrix), applies the output projection, and writes the result.
  3. Fused Residual + RMSNorm + GatedMLP + Residual: A single kernel reads the attention output, adds the first residual, normalizes, computes gate and up projections with fused GEMM, applies SiLU, computes the element-wise multiply, computes the down projection, adds the second residual, and writes the final output.

That is 3 kernel launches and roughly 4-6 global memory round-trips instead of 30.

Memory Bandwidth Utilization: Unfused vs Fused Transformer Layer (H100, Llama-70B, batch=1)

(GB/s effective bandwidth)
Unfused (PyTorch) Memory-bound
580 GB/s effective bandwidth
Partially fused (vLLM) Mixed
1,100 GB/s effective bandwidth
+89.7%
Fully fused (TRT-LLM) Compute-bound
1,850 GB/s effective bandwidth
+219.0%

Precision Calibration in Practice

The INT8 SmoothQuant Workflow

SmoothQuant addresses the challenge of quantizing transformer activations, which often have outlier channels with much larger magnitudes than the rest. Instead of naively quantizing and clipping these outliers, SmoothQuant migrates the quantization difficulty from activations to weights by applying a mathematically equivalent per-channel scaling:

Given Y=XWY = XW, SmoothQuant transforms this to Y=(Xdiag(s)1)(diag(s)W)Y = (X \cdot \text{diag}(s)^{-1}) \cdot (\text{diag}(s) \cdot W) where ss is a per-channel smoothing factor. The smoothed activations Xdiag(s)1X \cdot \text{diag}(s)^{-1} have a more uniform distribution that quantizes cleanly to INT8, and the smoothed weights diag(s)W\text{diag}(s) \cdot W absorb the scaling offline.

The FP8 Calibration Nuances

FP8 E4M3 has a dynamic range of approximately ±448\pm 448 with a minimum subnormal of 0.001953\approx 0.001953. This is narrower than FP16 (±65504\pm 65504) but wider than INT8 (128-128 to 127127 before scaling). The key calibration decisions:

Per-tensor vs per-channel scaling: Per-tensor scaling applies a single scale factor to the entire weight matrix or activation tensor. Per-channel scaling applies different factors per output channel. Per-channel is more accurate but requires the GEMM kernel to handle non-uniform scaling, which adds overhead. TensorRT-LLM supports both; per-tensor is the default and sufficient for most models.

Static vs dynamic scaling for activations: Static scaling uses a fixed scale factor determined during calibration. Dynamic scaling computes the scale factor at runtime from the actual activation values. Dynamic is more robust to distribution shifts but requires an extra reduction kernel per GEMM. For LLM inference where the activation distributions are reasonably stable, static scaling is preferred.

Which layers to quantize: Not all layers benefit equally from FP8. The first and last layers of the model, and the attention softmax, are often kept in higher precision. TensorRT-LLM’s default FP8 config quantizes all linear layers and attention BMMs while keeping softmax, layer norms, and the final LM head in FP16/BF16.

📊

FP8 Calibration Strategy Impact (Llama-2 13B, H100)

StrategyCalibration TimeTokens/sMMLU DeltaNotes
FP16 baseline N/A 185 0.0% Reference
FP8, per-tensor static 15 min 345 -0.2% Recommended default
FP8, per-channel static 15 min 330 -0.1% Slight throughput cost
FP8, per-tensor dynamic N/A 320 -0.1% No calibration needed
FP8, mixed (keep LM head FP16) 15 min 340 -0.1% Best quality/speed
Note: Calibration time is for the offline PTQ step on 512 samples.

Advanced Topics

Custom Plugin Development

When TensorRT’s built-in fusion patterns do not cover a novel operation — perhaps a new attention variant, a custom normalization scheme, or a routing mechanism for mixture-of-experts — you need to write a TensorRT plugin. This is a C++ class that implements the IPluginV2DynamicExt interface:

class MoERoutingPlugin : public nvinfer1::IPluginV2DynamicExt {
public:
    // Called during build to specify output shapes
    DimsExprs getOutputDimensions(
        int outputIndex,
        const DimsExprs* inputs,
        int nbInputs,
        IExprBuilder& exprBuilder
    ) override;

    // Called during build to configure the plugin
    void configurePlugin(
        const DynamicPluginTensorDesc* in,
        int nbInputs,
        const DynamicPluginTensorDesc* out,
        int nbOutputs
    ) override;

    // The actual GPU kernel dispatch
    int enqueue(
        const PluginTensorDesc* inputDesc,
        const PluginTensorDesc* outputDesc,
        const void* const* inputs,
        void* const* outputs,
        void* workspace,
        cudaStream_t stream
    ) override;

    // Workspace memory requirements
    size_t getWorkspaceSize(
        const PluginTensorDesc* inputs,
        int nbInputs,
        const PluginTensorDesc* outputs,
        int nbOutputs
    ) const override;
};

The enqueue method is where you launch your custom CUDA kernel. The challenge is that your plugin participates in TensorRT’s optimization pipeline: TensorRT may choose to run your plugin in FP16 or FP8 (if you report support), and it will profile your plugin alongside its built-in kernels. Your plugin needs to be competitive, or TensorRT will work around it.

Speculative Decoding in TensorRT-LLM

Speculative decoding uses a small “draft” model to generate multiple candidate tokens, then verifies them in parallel with the large “target” model. If the draft model guesses correctly (which it does 60-80% of the time for typical text), you get multiple tokens for the cost of one target model forward pass.

TensorRT-LLM supports speculative decoding with both separate draft models and self-speculative decoding (using the target model’s early layers as the draft). The key optimization is that the verification step is a single forward pass with a batch dimension equal to the speculation length, which amortizes the memory bandwidth cost of loading the model weights.

# Configuring speculative decoding in TensorRT-LLM
from tensorrt_llm.runtime import ModelRunnerCpp

runner = ModelRunnerCpp.from_dir(
    engine_dir="./llama-70b-fp8-engine",
    rank=0,
    # Speculative decoding config
    max_draft_len=5,           # Speculate up to 5 tokens
    is_medusa=False,           # Using standard draft model
)
📊

Speculative Decoding Impact (Llama-2 70B target, Llama-2 7B draft, H100)

ScenarioStandard (tok/s)Speculative (tok/s)Acceptance RateSpeedup
Code generation 38 68 78% 1.79x
English prose 38 62 72% 1.63x
Translation (en-zh) 38 52 58% 1.37x
Creative writing 38 48 52% 1.26x
Note: Batch=1, latency-optimized. Acceptance rate varies by domain and temperature.

Memory Planning and KV Cache Budgeting

A critical operational concern with TensorRT-LLM is memory planning. The engine itself consumes a fixed amount of GPU memory (model weights), but the KV cache grows with the number of concurrent requests and their sequence lengths. The relationship is:

KV cache memory=2×nlayers×nkv_heads×dhead×precision_bytes×total_active_tokens\text{KV cache memory} = 2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{precision\_bytes} \times \text{total\_active\_tokens}

For Llama-2 70B in FP16 with GQA (8 KV heads, 128 head dim, 80 layers):

Per-token KV=2×80×8×128×2=327,680 bytes320 KB\text{Per-token KV} = 2 \times 80 \times 8 \times 128 \times 2 = 327,680 \text{ bytes} \approx 320 \text{ KB}

On a single H100 with 80 GB, after loading the FP8 model weights (~35 GB) and reserving memory for activations (~5 GB), you have ~40 GB for KV cache. That supports:

Max tokens=40×109327,680122,000 tokens\text{Max tokens} = \frac{40 \times 10^9}{327,680} \approx 122,000 \text{ tokens}

With a max sequence length of 4096, that is roughly 30 concurrent requests. With paged attention, unused pages are freed immediately, so the actual concurrency can be higher for shorter sequences.

ℹ️ The KV Cache is the Real Bottleneck

For large models at high concurrency, KV cache memory — not compute — limits your throughput. This is why quantizing the KV cache to INT8 or even INT4 (with minimal quality loss) is becoming standard practice. TensorRT-LLM supports KV cache quantization separately from weight quantization, allowing FP8 weights with INT8 KV cache.

Practical Deployment Recommendations

Choosing Your Configuration

The decision tree for TensorRT-LLM deployment:

  1. What GPU do you have? H100/H200: use FP8. A100: use INT8 SmoothQuant or INT4 AWQ. Older GPUs: use FP16 with INT8 weight-only quantization.

  2. What is your latency budget? If sub-100ms TTFT matters, use tensor parallelism across the minimum number of GPUs needed. If throughput is all that matters, maximize batch size before adding GPUs.

  3. What is your sequence length distribution? If most requests are short (fewer than 512 tokens), use smaller page sizes in the paged KV cache. If you serve long-context workloads (greater than 8K tokens), ensure you have enough memory headroom and consider KV cache quantization.

  4. How often does your model change? If rarely (monthly), TensorRT-LLM is ideal. If daily, the build overhead may be prohibitive — consider vLLM for development and TensorRT-LLM for production, with a build pipeline that produces engines overnight.

📊

Recommended Configurations by Use Case

Use CaseFrameworkPrecisionParallelismRationale
Chatbot (latency-sensitive) TRT-LLM FP8 TP across min GPUs Lowest TTFT
Batch processing TRT-LLM FP8 or INT4 Max batch size Highest throughput
Multi-model serving vLLM AWQ INT4 Per-model TP LoRA swap flexibility
Structured output (JSON) SGLang FP16/INT4 TP Constrained decoding
Research/prototyping vLLM FP16 Single GPU Fast iteration
Edge deployment TRT-LLM INT4 Single GPU Minimum memory

Monitoring and Troubleshooting

Key metrics to track in a TensorRT-LLM deployment:

  • Time to first token (TTFT): Dominated by prefill time. If this spikes, check for long input sequences or insufficient tensor parallelism.
  • Inter-token latency (ITL): Should be stable across the generation. If it varies, check for KV cache memory pressure causing page swapping.
  • KV cache utilization: Monitor the fraction of allocated KV cache pages in use. If consistently above 90%, you are at risk of request rejection.
  • Batch size utilization: The actual number of requests being processed per iteration. Low utilization means your request rate is too low to fill the batch, or your scheduling is suboptimal.
  • GPU SM occupancy: Should be above 80% during decode steps. Low occupancy suggests the batch size is too small to saturate the GPU.

Conclusion

TensorRT-LLM sits at one end of the inference optimization spectrum: maximum performance, maximum complexity, minimum flexibility. Its graph optimization pipeline — layer fusion, kernel auto-tuning, precision calibration — genuinely produces faster inference than any open-source alternative on NVIDIA hardware. The FP8 workflow on Hopper GPUs delivers near-FP16 quality at nearly double the throughput, making it the default choice for latency-sensitive production deployments.

But “fastest” is not always “best.” The engine build process is slow and brittle, model coverage lags the open-source community, and operational complexity is higher. For many teams, vLLM’s simplicity and rapid model support make it the better overall choice, even at 20-30% lower throughput.

The practical recommendation: use vLLM for development and experimentation, build TensorRT-LLM engines for production models that are stable, and consider SGLang when structured generation is central to your application. Monitor the ecosystem closely — the performance gap is narrowing as torch.compile, Triton kernels, and other open-source optimization efforts mature.