vLLM and SGLang are open-source, Python-native, and easy to modify. TensorRT-LLM is closed-source, C++-based, and requires a week to understand the build system. Yet TensorRT-LLM consistently delivers 1.3-2x higher throughput on NVIDIA GPUs. The reason: TensorRT fundamentally restructures your computation graph through layer fusion, kernel auto-tuning, and precision calibration that PyTorch cannot match. When you compile a model with TensorRT, it fuses LayerNorm + Linear into a single kernel launch, tunes every GEMM for your specific GPU architecture, and calibrates FP8 quantization per layer. This optimization tax is paid once at compile time, then every inference request benefits. The tradeoff: zero flexibility, NVIDIA-only, and debugging is nearly impossible. This post covers TensorRT’s optimization pipeline, TensorRT-LLM’s serving features, when it wins over open-source alternatives, and the operational cost of vendor lock-in.
How TensorRT Optimizes Computation Graphs
At its core, TensorRT is a compiler. It takes a computation graph (typically imported from ONNX or defined through the TensorRT API) and produces an optimized engine — a binary blob of fused CUDA kernels tuned for a specific GPU, specific input shapes, and specific numerical precision. The optimization pipeline has several distinct phases.
Phase 1: Graph-Level Transformations
Before any kernel selection happens, TensorRT performs algebraic simplifications and structural transformations on the graph:
Constant folding eliminates operations whose outputs can be computed at build time. If a layer normalization has static scale and bias parameters, and those flow into a subsequent linear layer with known weights, TensorRT can precompute the combined transformation.
Dead layer elimination removes nodes whose outputs are never consumed by any downstream operation. This sounds trivial, but it matters when importing models from training frameworks that leave behind dropout layers, auxiliary loss heads, or debugging outputs.
Transpose and reshape propagation pushes data layout transformations through the graph to minimize the number of actual memory reformatting operations. If a transpose feeds into a GEMM that expects the transposed layout anyway, the transpose node can be eliminated entirely.
Phase 2: Layer Fusion
Layer fusion is where TensorRT’s real performance gains begin. The optimizer identifies patterns of adjacent operations that can be executed by a single, hand-optimized CUDA kernel instead of multiple separate kernel launches. Each separate kernel launch incurs overhead: the CPU must issue the launch, the GPU must schedule it, and intermediate results must be written to and read from global memory. Fusion eliminates all of this.
The key fusion patterns for transformer models include:
GEMM + Bias + Activation fusion. A linear layer followed by a bias addition followed by a GeLU or SiLU activation becomes a single kernel. The bias addition and activation are computed inline as the GEMM writes its output, avoiding two additional global memory round-trips.
Multi-Head Attention fusion. The Q, K, V projections (three separate GEMMs), the scaled dot-product attention, and the output projection can be fused into a single Flash Attention-style kernel. TensorRT-LLM ships with highly optimized fused MHA/MQA/GQA kernels for various head dimensions and sequence lengths.
LayerNorm + Linear fusion. The normalization and subsequent projection are fused so that the normalized values never materialize in global memory.
Residual connection fusion. The element-wise addition of a residual connection is folded into the preceding operation’s output write.
# Unfused execution (simplified pseudocode showing memory round-trips):
#
# x = LayerNorm(input) # Read input, write x to GMEM
# q = Linear_Q(x) # Read x, write q to GMEM
# k = Linear_K(x) # Read x, write k to GMEM
# v = Linear_V(x) # Read x, write v to GMEM
# attn = ScaledDotProduct(q, k, v) # Read q, k, v, write attn to GMEM
# out = Linear_O(attn) # Read attn, write out to GMEM
# result = out + input # Read out, read input, write result to GMEM
#
# Fused execution:
#
# result = FusedAttentionBlock(input) # Read input once, write result once
# Internal computation stays in registers/shared memory
Modern GPUs have compute throughput that far exceeds their memory bandwidth. An A100 delivers 312 TFLOPS of FP16 compute but only 2 TB/s of HBM bandwidth. For a typical transformer layer, the unfused execution is memory-bandwidth-bound — most of the time is spent moving data between global memory and the compute units. Fusion keeps intermediate values in registers and shared memory, shifting the bottleneck from memory to compute where GPUs excel.
Phase 3: Kernel Auto-Tuning
After fusion determines what to compute, kernel auto-tuning determines how. For each fused operation, TensorRT has a library of implementation variants — different tile sizes, different numbers of warps, different use of shared memory, different instruction sequences. During the build phase, TensorRT benchmarks each variant on the actual target GPU with the actual input dimensions and selects the fastest one.
This is why TensorRT engines are not portable across GPU architectures. An engine built on an A100 will not run on an H100, and an engine built for sequence length 2048 may not be optimal for sequence length 512. The auto-tuning is also why the build phase can take minutes to hours for large models: TensorRT is literally running thousands of micro-benchmarks.
For GEMM operations specifically, TensorRT evaluates:
- Tile dimensions: How the output matrix is partitioned across thread blocks (e.g., 128x128, 64x256, 256x64)
- Warp-level GEMM shape: The tile size each warp processes (e.g., 16x16x16 for Tensor Core wmma instructions)
- Pipeline depth: How many stages of double-buffered shared memory loads to overlap with computation
- Epilogue fusion: Whether the bias, activation, and residual addition are fused into the GEMM epilogue or handled separately
Kernel Auto-Tuning Impact: GEMM Variants on H100 (FP16, M=4096, N=4096, K=4096)
| Variant | Tile Size | Pipeline Stages | Latency (us) | TFLOPS |
|---|---|---|---|---|
| Naive | 64x64 | 1 | 342 | 401 |
| Good | 128x128 | 3 | 198 | 693 |
| Better | 128x256 | 4 | 156 | 880 |
| Auto-tuned best | 256x128 | 5 | 131 | 1048 |
Phase 4: Precision Calibration
TensorRT supports FP32, FP16, BF16, FP8 (on Hopper and later), and INT8 execution. Lower precision reduces memory bandwidth requirements (the dominant bottleneck) and increases Tensor Core throughput. But lower precision introduces quantization error that can degrade model quality.
FP16/BF16 is essentially free for transformer models. The accuracy loss is negligible, and you get a 2x reduction in memory bandwidth and a 2x increase in Tensor Core throughput. This is the baseline for any serious deployment.
INT8 requires calibration. TensorRT runs the model on a representative calibration dataset in FP32, collects the activation distribution at each layer, and determines optimal per-tensor or per-channel scale factors that minimize quantization error. The calibration strategies include:
- MinMax calibration: Uses the observed min/max of activations to set the quantization range. Simple but sensitive to outliers.
- Entropy calibration (KL divergence): Finds the quantization range that minimizes the information loss between the FP32 and INT8 distributions. More robust than MinMax.
- Percentile calibration: Clips the top/bottom N% of the activation distribution before setting the range, providing outlier robustness.
FP8 (E4M3 and E5M2 formats) on Hopper GPUs provides a sweet spot between FP16 and INT8: nearly FP16 accuracy with INT8-class throughput. We cover the FP8 workflow in detail later in this post.
Precision vs Throughput vs Quality (Llama-2 70B, single H100, batch=1, output 128 tokens)
| Precision | Tokens/s | Memory (GB) | MMLU Score | Relative Quality |
|---|---|---|---|---|
| FP16 | 38 | 140 | 68.9 | Baseline |
| FP8 (E4M3) | 72 | 75 | 68.7 | -0.3% |
| INT8 (SmoothQuant) | 65 | 75 | 68.2 | -1.0% |
| INT4 (AWQ) | 95 | 42 | 67.1 | -2.6% |
| INT4 (GPTQ) | 93 | 42 | 66.8 | -3.0% |
TensorRT-LLM Architecture
TensorRT-LLM is not simply TensorRT applied to LLMs. It is a separate library built on top of TensorRT that adds the runtime machinery needed for autoregressive text generation. The architecture has three major components.
The Model Definition Layer
TensorRT-LLM provides a Python API for defining transformer architectures using TensorRT primitives. This API looks superficially like PyTorch but produces a TensorRT network graph instead of executing eagerly:
import tensorrt_llm
from tensorrt_llm.layers import (
Attention, ColumnLinear, GatedMLP,
RmsNorm, Embedding
)
class LlamaDecoderLayer(tensorrt_llm.Module):
def __init__(self, config):
super().__init__()
self.input_layernorm = RmsNorm(
normalized_shape=config.hidden_size,
eps=config.rms_norm_eps
)
self.attention = Attention(
hidden_size=config.hidden_size,
num_attention_heads=config.num_attention_heads,
num_kv_heads=config.num_key_value_heads,
max_position_embeddings=config.max_position_embeddings,
attention_type='gqa' # Grouped Query Attention
)
self.post_attention_layernorm = RmsNorm(
normalized_shape=config.hidden_size,
eps=config.rms_norm_eps
)
self.mlp = GatedMLP(
hidden_size=config.hidden_size,
ffn_hidden_size=config.intermediate_size,
hidden_act='silu'
)
def forward(self, hidden_states, attention_mask, position_ids, kv_cache):
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
hidden_states = self.attention(
hidden_states, attention_mask, position_ids, kv_cache
)
hidden_states = residual + hidden_states
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states
When you call build(), this Python description is traced into a TensorRT graph, optimized through the fusion/autotuning pipeline described above, and serialized into an engine file.
The Runtime: In-Flight Batching and Paged KV Cache
The TensorRT-LLM runtime (implemented in C++) handles the dynamic aspects of LLM serving that a static TensorRT engine cannot:
In-flight batching (also called continuous batching or iteration-level scheduling) allows new requests to join a running batch without waiting for all current requests to finish. In autoregressive generation, different requests finish at different times. Without in-flight batching, the GPU sits idle for the fastest requests while waiting for the slowest. With it, new requests are immediately inserted into the freed slots.
Paged KV cache borrows the virtual memory concept from operating systems. Instead of pre-allocating a contiguous KV cache for the maximum sequence length, memory is allocated in small fixed-size pages (typically 64-128 tokens each). Pages are allocated on demand as the sequence grows and can be non-contiguous in physical GPU memory. This dramatically reduces memory waste from over-allocation and enables serving more concurrent requests.
Chunked prefill breaks long input prompts into chunks and interleaves their processing with decode steps from other requests, preventing long prompts from creating latency spikes for concurrent short requests.
Multi-GPU Support: Tensor Parallelism and Pipeline Parallelism
For models too large to fit on a single GPU, TensorRT-LLM supports:
Tensor parallelism splits individual layers across GPUs. Each attention head group and each MLP column is assigned to a different GPU. This requires all-reduce communication after every attention and MLP layer but keeps latency low because every GPU works on every token.
Pipeline parallelism assigns different layers to different GPUs. GPU 0 processes layers 0-19, GPU 1 processes layers 20-39, and so on. This requires less communication (only at pipeline stage boundaries) but introduces pipeline bubbles and is primarily useful for very large models where tensor parallelism alone is insufficient.
TensorRT-LLM Multi-GPU Scaling (Llama-2 70B, H100 NVLink, batch=64, output 128 tokens)
| GPUs | Parallelism | Tokens/s | Latency P50 (ms) | Efficiency |
|---|---|---|---|---|
| 1 (FP8) | None | 2,100 | 3,900 | 100% |
| 2 (FP8) | TP=2 | 3,800 | 2,150 | 90% |
| 4 (FP8) | TP=4 | 6,900 | 1,180 | 82% |
| 8 (FP8) | TP=8 | 11,200 | 730 | 67% |
| 8 (FP8) | TP=4, PP=2 | 10,500 | 780 | 63% |
The FP8 Inference Workflow
FP8 on Hopper (H100/H200) GPUs is the current sweet spot for LLM inference. The E4M3 format (4 exponent bits, 3 mantissa bits) provides enough dynamic range and precision for most transformer weights and activations while doubling Tensor Core throughput compared to FP16.
Step 1: Quantize the Model
There are two approaches:
Post-Training Quantization (PTQ) quantizes a pre-trained FP16 model by calibrating scale factors on a representative dataset. NVIDIA’s ammo (now modelopt) toolkit handles this:
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_tensorrt_llm_checkpoint
# Load your HuggingFace model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf")
# Define FP8 quantization config
quant_config = mtq.FP8_DEFAULT_CFG
# This applies per-tensor FP8 quantization to:
# - Linear layer weights (E4M3)
# - Linear layer activations (E4M3)
# - Attention BMM inputs (E4M3)
# Calibrate on representative data
def calibrate_loop(model):
for batch in calibration_dataloader:
model(batch["input_ids"].cuda())
mtq.quantize(model, quant_config, forward_loop=calibrate_loop)
# Export quantized checkpoint for TensorRT-LLM
export_tensorrt_llm_checkpoint(
model,
decoder_type="llama",
dtype="bfloat16", # Non-quantized layers stay in BF16
export_dir="./llama-70b-fp8-ckpt",
inference_tensor_parallel=4
)
Quantization-Aware Training (QAT) inserts fake-quantization nodes during fine-tuning so the model learns to be robust to FP8 rounding. This produces slightly better quality than PTQ but requires a training run.
Step 2: Build the TensorRT-LLM Engine
# Build the engine from the quantized checkpoint
trtllm-build \
--checkpoint_dir ./llama-70b-fp8-ckpt \
--output_dir ./llama-70b-fp8-engine \
--gemm_plugin fp8 \
--gpt_attention_plugin float16 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_seq_len 4096 \
--tp_size 4 \
--workers 4 \
--use_paged_context_fmha enable \
--multiple_profiles enable
The --gemm_plugin fp8 flag tells TensorRT to use FP8 Tensor Core kernels for all GEMM operations. The --gpt_attention_plugin float16 keeps the attention softmax in FP16 to avoid the limited dynamic range of FP8 for probability distributions.
Step 3: Serve with the Triton Inference Server
# In production, TensorRT-LLM engines are typically deployed
# behind Triton Inference Server with the TRT-LLM backend:
#
# model_repository/
# llama-70b/
# config.pbtxt
# 1/
# # Engine files placed here
#
# Triton handles HTTP/gRPC endpoints, request queuing,
# and the in-flight batching scheduler.
For FP8 PTQ, the calibration dataset matters more than you might expect. Use 512-1024 samples that are representative of your production traffic. Calibration with only English text and then serving multilingual requests can cause quality degradation in non-English languages. Also, calibrate with the sequence lengths you will actually serve — the activation distributions shift with sequence length.
TensorRT-LLM vs vLLM vs SGLang
The choice between TensorRT-LLM and the open-source alternatives is not straightforward. Each has genuine strengths.
vLLM
vLLM pioneered PagedAttention and made continuous batching accessible to the open-source community. Its strengths:
- Broad model support: Community-driven model implementations cover nearly every architecture on HuggingFace, including new models within days of release
- Simple deployment:
pip install vllm && python -m vllm.entrypoints.openai.api_servergets you running in minutes - Flexibility: Custom sampling strategies, speculative decoding, LoRA serving, prefix caching are all first-class features
- Active community: Bugs get fixed quickly, new features appear rapidly
Its limitations relative to TensorRT-LLM: vLLM’s CUDA kernels are good but not as aggressively optimized. vLLM relies on torch.compile and manually written Triton kernels rather than TensorRT’s exhaustive auto-tuning. For a given model on a given GPU, TensorRT-LLM typically produces 15-40% higher throughput.
SGLang
SGLang focuses on structured generation and complex LLM programs (chains of calls, branching logic, constrained decoding). Its strengths:
- RadixAttention: Efficient prefix sharing across requests, which is particularly valuable for chat applications and few-shot prompting where many requests share the same system prompt
- Constrained decoding: Native support for JSON schema and regex-constrained generation with minimal overhead via a compressed finite state machine
- Frontend language: A Python DSL for expressing multi-turn LLM programs that enables automatic optimization of the execution plan
SGLang’s raw throughput on simple generation tasks is competitive with vLLM but generally behind TensorRT-LLM.
When TensorRT-LLM Wins
Framework Comparison: Llama-2 70B, 4xH100 NVLink, Various Workloads
| Workload | TRT-LLM (tok/s) | vLLM (tok/s) | SGLang (tok/s) | TRT-LLM Advantage |
|---|---|---|---|---|
| Batch=1, latency-optimized | 68 | 52 | 54 | +31% |
| Batch=64, throughput-optimized | 6,900 | 5,100 | 5,300 | +35% |
| Batch=256, throughput-saturated | 12,400 | 10,800 | 11,100 | +15% |
| Long context (32k tokens) | 1,200 | 1,050 | 1,100 | +14% |
| Shared prefix (90% overlap) | 8,100 | 6,800 | 9,200 | -12% vs SGLang |
TensorRT-LLM is the right choice when:
-
Latency is the primary constraint. If you need the absolute lowest time-to-first-token and time-per-output-token for a single model, TensorRT-LLM’s kernel auto-tuning and aggressive fusion produce measurably faster results. Real-time applications — voice assistants, interactive coding tools, trading systems — benefit directly.
-
You are deploying a single, stable model. TensorRT engines are built for a specific model with specific maximum shapes. If your model changes rarely and your input distribution is predictable, the upfront build cost is amortized over millions of requests.
-
You are on NVIDIA hardware and want maximum utilization. TensorRT is tuned specifically for each NVIDIA GPU generation. On H100s with FP8, the throughput advantage over vLLM can exceed 35%.
-
You need production-grade reliability at scale. TensorRT-LLM plus Triton Inference Server is the stack NVIDIA supports for enterprise deployments. If you are operating hundreds of GPUs, the monitoring, health checks, and operational tooling around Triton matter.
When TensorRT-LLM Loses
-
Rapid model iteration. Building a TensorRT-LLM engine for a 70B model takes 30-60 minutes. If you are experimenting with different models, different quantizations, or different LoRA adapters daily, the build overhead is painful. vLLM loads a HuggingFace checkpoint in minutes.
-
Multi-model and multi-LoRA serving. If you need to serve dozens of fine-tuned variants of the same base model, vLLM’s LoRA support lets you swap adapters at request time with a single base model in memory. TensorRT-LLM requires a separate engine for each LoRA (though NVIDIA is working on runtime LoRA support).
-
Cutting-edge model architectures. When a new architecture drops (Mamba, RWKV, a novel attention pattern), vLLM and SGLang support it within days thanks to community contributions. TensorRT-LLM support may lag by weeks or months because adding a new architecture requires writing optimized C++ kernels.
-
Structured generation workflows. If your application requires heavy constrained decoding, multi-step LLM programs, or complex prefix sharing patterns, SGLang’s RadixAttention and constrained decoding engine will outperform TensorRT-LLM.
-
Non-NVIDIA hardware. TensorRT is NVIDIA-only. If you need to target AMD, Intel, or cloud TPUs, vLLM (with ROCm support) or other frameworks are your only option.
TensorRT-LLM’s build process is not just slow — it is also fragile. The engine is tied to a specific TensorRT version, a specific CUDA toolkit version, and specific maximum input/output lengths. Upgrading TensorRT or changing your max sequence length requires a full rebuild. Plan your CI/CD pipeline accordingly.
Deep Dive: How Layer Fusion Works for Transformers
To understand the performance difference between TensorRT-LLM and framework-based solutions, it helps to trace a single transformer layer’s execution in detail.
The Unfused Baseline
In a standard PyTorch execution of a Llama decoder layer:
- RMSNorm: Read hidden_states from HBM (8192 * batch * 2 bytes for FP16), compute variance, normalize, write back. Two global memory passes.
- Q projection: Read normalized hidden_states + weights, compute GEMM, write Q to HBM.
- K projection: Read normalized hidden_states + weights, compute GEMM, write K to HBM.
- V projection: Read normalized hidden_states + weights, compute GEMM, write V to HBM.
- RoPE: Read Q and K from HBM, apply rotary positional embeddings, write back.
- KV cache update: Read K, V, write to cache.
- Attention: Read Q, K (from cache), compute QK^T, softmax, read V, compute output, write to HBM.
- Output projection: Read attention output + weights, GEMM, write to HBM.
- Residual add: Read output projection result + original input, add, write to HBM.
- Second RMSNorm: Read, normalize, write.
- Gate projection: Read + GEMM + write.
- Up projection: Read + GEMM + write.
- SiLU + element-wise multiply: Read gate output + up output, compute, write.
- Down projection: Read + GEMM + write.
- Second residual add: Read + add + write.
That is 15 separate kernel launches and roughly 30 global memory reads/writes of the hidden state tensor.
The Fused TensorRT-LLM Execution
After TensorRT’s optimization:
- Fused RMSNorm + QKV GEMM + RoPE: A single kernel reads hidden_states, normalizes in registers, computes the three projections with a single wide GEMM (fused Q/K/V), applies RoPE to Q and K, and writes Q, K, V directly to the appropriate memory locations (K and V go directly to the paged cache).
- Fused Flash Attention + Output Projection: A single kernel reads Q from registers/shared memory, streams K/V from the paged cache, computes attention with online softmax (never materializing the full attention matrix), applies the output projection, and writes the result.
- Fused Residual + RMSNorm + GatedMLP + Residual: A single kernel reads the attention output, adds the first residual, normalizes, computes gate and up projections with fused GEMM, applies SiLU, computes the element-wise multiply, computes the down projection, adds the second residual, and writes the final output.
That is 3 kernel launches and roughly 4-6 global memory round-trips instead of 30.
Memory Bandwidth Utilization: Unfused vs Fused Transformer Layer (H100, Llama-70B, batch=1)
(GB/s effective bandwidth)Precision Calibration in Practice
The INT8 SmoothQuant Workflow
SmoothQuant addresses the challenge of quantizing transformer activations, which often have outlier channels with much larger magnitudes than the rest. Instead of naively quantizing and clipping these outliers, SmoothQuant migrates the quantization difficulty from activations to weights by applying a mathematically equivalent per-channel scaling:
Given , SmoothQuant transforms this to where is a per-channel smoothing factor. The smoothed activations have a more uniform distribution that quantizes cleanly to INT8, and the smoothed weights absorb the scaling offline.
The FP8 Calibration Nuances
FP8 E4M3 has a dynamic range of approximately with a minimum subnormal of . This is narrower than FP16 () but wider than INT8 ( to before scaling). The key calibration decisions:
Per-tensor vs per-channel scaling: Per-tensor scaling applies a single scale factor to the entire weight matrix or activation tensor. Per-channel scaling applies different factors per output channel. Per-channel is more accurate but requires the GEMM kernel to handle non-uniform scaling, which adds overhead. TensorRT-LLM supports both; per-tensor is the default and sufficient for most models.
Static vs dynamic scaling for activations: Static scaling uses a fixed scale factor determined during calibration. Dynamic scaling computes the scale factor at runtime from the actual activation values. Dynamic is more robust to distribution shifts but requires an extra reduction kernel per GEMM. For LLM inference where the activation distributions are reasonably stable, static scaling is preferred.
Which layers to quantize: Not all layers benefit equally from FP8. The first and last layers of the model, and the attention softmax, are often kept in higher precision. TensorRT-LLM’s default FP8 config quantizes all linear layers and attention BMMs while keeping softmax, layer norms, and the final LM head in FP16/BF16.
FP8 Calibration Strategy Impact (Llama-2 13B, H100)
| Strategy | Calibration Time | Tokens/s | MMLU Delta | Notes |
|---|---|---|---|---|
| FP16 baseline | N/A | 185 | 0.0% | Reference |
| FP8, per-tensor static | 15 min | 345 | -0.2% | Recommended default |
| FP8, per-channel static | 15 min | 330 | -0.1% | Slight throughput cost |
| FP8, per-tensor dynamic | N/A | 320 | -0.1% | No calibration needed |
| FP8, mixed (keep LM head FP16) | 15 min | 340 | -0.1% | Best quality/speed |
Advanced Topics
Custom Plugin Development
When TensorRT’s built-in fusion patterns do not cover a novel operation — perhaps a new attention variant, a custom normalization scheme, or a routing mechanism for mixture-of-experts — you need to write a TensorRT plugin. This is a C++ class that implements the IPluginV2DynamicExt interface:
class MoERoutingPlugin : public nvinfer1::IPluginV2DynamicExt {
public:
// Called during build to specify output shapes
DimsExprs getOutputDimensions(
int outputIndex,
const DimsExprs* inputs,
int nbInputs,
IExprBuilder& exprBuilder
) override;
// Called during build to configure the plugin
void configurePlugin(
const DynamicPluginTensorDesc* in,
int nbInputs,
const DynamicPluginTensorDesc* out,
int nbOutputs
) override;
// The actual GPU kernel dispatch
int enqueue(
const PluginTensorDesc* inputDesc,
const PluginTensorDesc* outputDesc,
const void* const* inputs,
void* const* outputs,
void* workspace,
cudaStream_t stream
) override;
// Workspace memory requirements
size_t getWorkspaceSize(
const PluginTensorDesc* inputs,
int nbInputs,
const PluginTensorDesc* outputs,
int nbOutputs
) const override;
};
The enqueue method is where you launch your custom CUDA kernel. The challenge is that your plugin participates in TensorRT’s optimization pipeline: TensorRT may choose to run your plugin in FP16 or FP8 (if you report support), and it will profile your plugin alongside its built-in kernels. Your plugin needs to be competitive, or TensorRT will work around it.
Speculative Decoding in TensorRT-LLM
Speculative decoding uses a small “draft” model to generate multiple candidate tokens, then verifies them in parallel with the large “target” model. If the draft model guesses correctly (which it does 60-80% of the time for typical text), you get multiple tokens for the cost of one target model forward pass.
TensorRT-LLM supports speculative decoding with both separate draft models and self-speculative decoding (using the target model’s early layers as the draft). The key optimization is that the verification step is a single forward pass with a batch dimension equal to the speculation length, which amortizes the memory bandwidth cost of loading the model weights.
# Configuring speculative decoding in TensorRT-LLM
from tensorrt_llm.runtime import ModelRunnerCpp
runner = ModelRunnerCpp.from_dir(
engine_dir="./llama-70b-fp8-engine",
rank=0,
# Speculative decoding config
max_draft_len=5, # Speculate up to 5 tokens
is_medusa=False, # Using standard draft model
)
Speculative Decoding Impact (Llama-2 70B target, Llama-2 7B draft, H100)
| Scenario | Standard (tok/s) | Speculative (tok/s) | Acceptance Rate | Speedup |
|---|---|---|---|---|
| Code generation | 38 | 68 | 78% | 1.79x |
| English prose | 38 | 62 | 72% | 1.63x |
| Translation (en-zh) | 38 | 52 | 58% | 1.37x |
| Creative writing | 38 | 48 | 52% | 1.26x |
Memory Planning and KV Cache Budgeting
A critical operational concern with TensorRT-LLM is memory planning. The engine itself consumes a fixed amount of GPU memory (model weights), but the KV cache grows with the number of concurrent requests and their sequence lengths. The relationship is:
For Llama-2 70B in FP16 with GQA (8 KV heads, 128 head dim, 80 layers):
On a single H100 with 80 GB, after loading the FP8 model weights (~35 GB) and reserving memory for activations (~5 GB), you have ~40 GB for KV cache. That supports:
With a max sequence length of 4096, that is roughly 30 concurrent requests. With paged attention, unused pages are freed immediately, so the actual concurrency can be higher for shorter sequences.
For large models at high concurrency, KV cache memory — not compute — limits your throughput. This is why quantizing the KV cache to INT8 or even INT4 (with minimal quality loss) is becoming standard practice. TensorRT-LLM supports KV cache quantization separately from weight quantization, allowing FP8 weights with INT8 KV cache.
Practical Deployment Recommendations
Choosing Your Configuration
The decision tree for TensorRT-LLM deployment:
-
What GPU do you have? H100/H200: use FP8. A100: use INT8 SmoothQuant or INT4 AWQ. Older GPUs: use FP16 with INT8 weight-only quantization.
-
What is your latency budget? If sub-100ms TTFT matters, use tensor parallelism across the minimum number of GPUs needed. If throughput is all that matters, maximize batch size before adding GPUs.
-
What is your sequence length distribution? If most requests are short (fewer than 512 tokens), use smaller page sizes in the paged KV cache. If you serve long-context workloads (greater than 8K tokens), ensure you have enough memory headroom and consider KV cache quantization.
-
How often does your model change? If rarely (monthly), TensorRT-LLM is ideal. If daily, the build overhead may be prohibitive — consider vLLM for development and TensorRT-LLM for production, with a build pipeline that produces engines overnight.
Recommended Configurations by Use Case
| Use Case | Framework | Precision | Parallelism | Rationale |
|---|---|---|---|---|
| Chatbot (latency-sensitive) | TRT-LLM | FP8 | TP across min GPUs | Lowest TTFT |
| Batch processing | TRT-LLM | FP8 or INT4 | Max batch size | Highest throughput |
| Multi-model serving | vLLM | AWQ INT4 | Per-model TP | LoRA swap flexibility |
| Structured output (JSON) | SGLang | FP16/INT4 | TP | Constrained decoding |
| Research/prototyping | vLLM | FP16 | Single GPU | Fast iteration |
| Edge deployment | TRT-LLM | INT4 | Single GPU | Minimum memory |
Monitoring and Troubleshooting
Key metrics to track in a TensorRT-LLM deployment:
- Time to first token (TTFT): Dominated by prefill time. If this spikes, check for long input sequences or insufficient tensor parallelism.
- Inter-token latency (ITL): Should be stable across the generation. If it varies, check for KV cache memory pressure causing page swapping.
- KV cache utilization: Monitor the fraction of allocated KV cache pages in use. If consistently above 90%, you are at risk of request rejection.
- Batch size utilization: The actual number of requests being processed per iteration. Low utilization means your request rate is too low to fill the batch, or your scheduling is suboptimal.
- GPU SM occupancy: Should be above 80% during decode steps. Low occupancy suggests the batch size is too small to saturate the GPU.
Conclusion
TensorRT-LLM sits at one end of the inference optimization spectrum: maximum performance, maximum complexity, minimum flexibility. Its graph optimization pipeline — layer fusion, kernel auto-tuning, precision calibration — genuinely produces faster inference than any open-source alternative on NVIDIA hardware. The FP8 workflow on Hopper GPUs delivers near-FP16 quality at nearly double the throughput, making it the default choice for latency-sensitive production deployments.
But “fastest” is not always “best.” The engine build process is slow and brittle, model coverage lags the open-source community, and operational complexity is higher. For many teams, vLLM’s simplicity and rapid model support make it the better overall choice, even at 20-30% lower throughput.
The practical recommendation: use vLLM for development and experimentation, build TensorRT-LLM engines for production models that are stable, and consider SGLang when structured generation is central to your application. Monitor the ecosystem closely — the performance gap is narrowing as torch.compile, Triton kernels, and other open-source optimization efforts mature.