Static activation quantization uses fixed scale factors computed from calibration data—same scale for every input, every token, every request. Dynamic quantization recomputes scales at runtime for each input tensor. The accuracy difference can be dramatic: on OPT-13B, static per-tensor INT8 adds 2.1 perplexity points, while dynamic per-token INT8 adds 0.3 points—7x lower quality loss. The throughput cost is the min/max/reduce operation to compute scales on-the-fly, which adds 2-5% overhead on A100 but 8-12% overhead on older architectures without fast reduce ops. This is why vLLM uses dynamic per-token quantization by default for activations despite the compute overhead—the quality win justifies the throughput loss.
This post covers the quantization parameter problem, the calibration methods for static quantization (minmax, percentile, MSE-optimal), the runtime overhead of dynamic quantization, SmoothQuant as a hybrid approach that bridges the gap, and production decision criteria with benchmarks on Llama and Mistral models.
The Quantization Parameter Problem
Uniform Affine Quantization Review
For INT8 quantization, each floating-point tensor is mapped to 8-bit integers using a scale and zero-point :
The scale is determined by the tensor’s range:
For symmetric quantization (zero-point = 0), which is more common in INT8 GEMM kernels:
The question is: how do you determine and for activations?
# The fundamental difference:
# STATIC: scales are constants, computed once from calibration data
class StaticQuantLinear:
def __init__(self, weight_int8, weight_scale, act_scale):
self.weight_int8 = weight_int8 # [out, in] INT8
self.weight_scale = weight_scale # [out] or scalar
self.act_scale = act_scale # scalar (fixed at calibration time)
def forward(self, x_fp16):
# Quantize activation using pre-computed scale
x_int8 = torch.round(x_fp16 / self.act_scale).clamp(-128, 127).to(torch.int8)
# INT8 GEMM
out_int32 = torch.matmul(x_int8, self.weight_int8.T) # INT8 x INT8 -> INT32
# Dequantize
out_fp16 = out_int32.float() * (self.act_scale * self.weight_scale)
return out_fp16.half()
# DYNAMIC: scales are computed per-input at runtime
class DynamicQuantLinear:
def __init__(self, weight_int8, weight_scale):
self.weight_int8 = weight_int8
self.weight_scale = weight_scale
# No act_scale stored - computed at runtime
def forward(self, x_fp16):
# Compute scale from the actual input (runtime overhead)
act_scale = x_fp16.abs().max() / 127.0
x_int8 = torch.round(x_fp16 / act_scale).clamp(-128, 127).to(torch.int8)
out_int32 = torch.matmul(x_int8, self.weight_int8.T)
out_fp16 = out_int32.float() * (act_scale * self.weight_scale)
return out_fp16.half()
Static Quantization: Offline Calibration
Calibration Methods
Static quantization runs a calibration dataset through the model and records activation statistics. The choice of how to compute the scale from these statistics has a significant impact on quality.
# Method 1: Min-Max (simplest, often worst)
# Use the global min/max across all calibration samples
class MinMaxCalibrator:
def __init__(self):
self.running_min = float('inf')
self.running_max = float('-inf')
def observe(self, tensor):
self.running_min = min(self.running_min, tensor.min().item())
self.running_max = max(self.running_max, tensor.max().item())
def compute_scale(self):
# Symmetric quantization
abs_max = max(abs(self.running_min), abs(self.running_max))
return abs_max / 127.0
# Problem: a single outlier in any calibration sample determines the range
# for ALL future inputs. Outliers waste dynamic range.
# Method 2: Percentile (clip outliers)
class PercentileCalibrator:
def __init__(self, percentile=99.99):
self.percentile = percentile
self.all_values = []
def observe(self, tensor):
self.all_values.append(tensor.flatten().float().cpu())
def compute_scale(self):
all_vals = torch.cat(self.all_values)
abs_max = torch.quantile(all_vals.abs(), self.percentile / 100.0)
return (abs_max / 127.0).item()
# Clips the top 0.01% of values. Those outliers are saturated to +/-127.
# Works well when outliers are rare and the clipping error is small
# relative to the range expansion error of accommodating them.
# Method 3: MSE minimization (find scale that minimizes quantization error)
class MSECalibrator:
def __init__(self, n_bins=2048):
self.n_bins = n_bins
self.histograms = []
def observe(self, tensor):
# Build histogram of absolute values
hist = torch.histc(tensor.abs().float(), bins=self.n_bins, min=0,
max=tensor.abs().max().item())
self.histograms.append((hist, tensor.abs().max().item()))
def compute_scale(self):
# Try different clipping thresholds and pick the one
# that minimizes the mean squared quantization error
best_scale = None
best_mse = float('inf')
# Aggregate histogram
# ... merge histograms across samples ...
for clip_ratio in [i / 100.0 for i in range(80, 101)]:
trial_max = global_abs_max * clip_ratio
trial_scale = trial_max / 127.0
# Compute MSE of this quantization
mse = compute_quantization_mse(aggregated_hist, trial_scale)
if mse < best_mse:
best_mse = mse
best_scale = trial_scale
return best_scale
# Method 4: Entropy / KL-divergence (TensorRT's default)
class EntropyCalibrator:
"""
Find the clipping threshold that minimizes the KL divergence
between the original distribution and the quantized distribution.
"""
def __init__(self, n_bins=8192):
self.n_bins = n_bins
self.histogram = None
def observe(self, tensor):
vals = tensor.abs().float()
hist = torch.histc(vals, bins=self.n_bins, min=0, max=vals.max().item())
if self.histogram is None:
self.histogram = hist
else:
self.histogram += hist
def compute_scale(self):
# Try each possible number of bins as the quantized range
# and find the one that minimizes KL(original || quantized)
reference = self.histogram.clone()
reference /= reference.sum() # normalize to distribution
best_threshold_bin = self.n_bins
best_kl = float('inf')
for num_quantized_bins in range(128, self.n_bins + 1):
# Quantize the histogram to 128 levels up to num_quantized_bins
# and compute KL divergence
kl = compute_kl_divergence(reference, num_quantized_bins, 128)
if kl < best_kl:
best_kl = kl
best_threshold_bin = num_quantized_bins
# Convert bin index to threshold
threshold = best_threshold_bin / self.n_bins * global_abs_max
return threshold / 127.0
Static Calibration Method Comparison (Llama-2-7B W8A8)
| Calibration Method | WikiText-2 PPL | Calibration Time | Notes |
|---|---|---|---|
| FP16 baseline | 5.47 | N/A | Reference |
| Min-Max | 5.89 (+0.42) | 2 min | Outlier-sensitive |
| Percentile (99.99%) | 5.62 (+0.15) | 5 min | Good general choice |
| MSE minimization | 5.58 (+0.11) | 15 min | Best for uniform-ish distributions |
| KL-divergence (entropy) | 5.55 (+0.08) | 20 min | TensorRT default, best overall |
| Dynamic (per-token) | 5.49 (+0.02) | N/A | Near-lossless, runtime cost |
The Calibration Dataset Problem
Static calibration has a fundamental fragility: the scales are only optimal for inputs similar to the calibration data. If the deployment distribution differs, the scales may be badly wrong.
# Demonstration: calibration distribution mismatch
# Calibrate on English Wikipedia text
model_wiki_calibrated = static_quantize(model, calib_data=wiki_samples)
# Evaluate on different distributions:
eval_results = {
"WikiText-2 (in-distribution)": 5.55, # Good: matches calibration
"Code (Python)": 8.12, # Worse: different activation ranges
"Mathematical proofs": 9.45, # Much worse: many outliers in math tokens
"Chinese text": 7.89, # Worse: different token embeddings
"Mixed (real deployment traffic)": 6.82, # Moderate: average of distributions
}
# With dynamic quantization:
eval_results_dynamic = {
"WikiText-2": 5.49, # Slightly better
"Code (Python)": 5.52, # Much better: adapts to code activations
"Mathematical proofs": 5.68, # Much better: handles math outliers
"Chinese text": 5.51, # Much better: adapts to CJK embeddings
"Mixed (real deployment traffic)": 5.52, # Consistent across distributions
}
If your deployment serves heterogeneous traffic (multiple languages, code, math, structured data), dynamic quantization may be necessary to avoid quality regressions on out-of-calibration inputs. The calibration dataset must represent the full deployment distribution for static quantization to work well.
Dynamic Quantization: Per-Token Online Scaling
Granularity Options
Dynamic quantization can operate at different granularities, each with a different accuracy-overhead trade-off:
# Per-tensor dynamic: one scale for the entire activation tensor
# Cheapest but worst accuracy (same problem as static with outliers)
def dynamic_per_tensor(x):
scale = x.abs().max() / 127.0
return torch.round(x / scale).clamp(-128, 127).to(torch.int8), scale
# Per-token dynamic: one scale per token (row) in the activation matrix
# Good accuracy-overhead balance. Most common in practice.
def dynamic_per_token(x):
# x shape: [batch * seq_len, hidden_dim]
# One scale per row
scales = x.abs().amax(dim=-1, keepdim=True) / 127.0
x_int8 = torch.round(x / scales).clamp(-128, 127).to(torch.int8)
return x_int8, scales # scales shape: [batch * seq_len, 1]
# Per-group dynamic: one scale per group of elements within each token
# Best accuracy but highest overhead
def dynamic_per_group(x, group_size=128):
# x shape: [batch * seq_len, hidden_dim]
B, D = x.shape
x_reshaped = x.view(B, D // group_size, group_size)
scales = x_reshaped.abs().amax(dim=-1, keepdim=True) / 127.0
x_int8 = torch.round(x_reshaped / scales).clamp(-128, 127).to(torch.int8)
return x_int8.view(B, D), scales.view(B, D // group_size)
Runtime Overhead
The cost of dynamic quantization is the amax reduction plus the quantization (divide, round, clamp) applied to every activation tensor at every layer:
// Per-token dynamic quantization kernel
// This runs at every linear layer, before the INT8 GEMM
__global__ void quantize_per_token_int8(
const half* __restrict__ input, // [M, K]
int8_t* __restrict__ output, // [M, K]
float* __restrict__ scales, // [M]
int M, int K
) {
int row = blockIdx.x;
if (row >= M) return;
// Step 1: Find max absolute value in this row (reduction)
float local_max = 0.0f;
for (int col = threadIdx.x; col < K; col += blockDim.x) {
float val = __half2float(input[row * K + col]);
local_max = fmaxf(local_max, fabsf(val));
}
// Warp-level reduction to find row max
for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
local_max = fmaxf(local_max, __shfl_down_sync(0xffffffff, local_max, offset));
}
// Block-level reduction (across warps)
__shared__ float warp_maxes[32];
int warp_id = threadIdx.x / warpSize;
int lane_id = threadIdx.x % warpSize;
if (lane_id == 0) warp_maxes[warp_id] = local_max;
__syncthreads();
if (warp_id == 0) {
local_max = (lane_id < blockDim.x / warpSize) ? warp_maxes[lane_id] : 0.0f;
for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
local_max = fmaxf(local_max, __shfl_down_sync(0xffffffff, local_max, offset));
}
}
__shared__ float row_scale;
if (threadIdx.x == 0) {
row_scale = local_max / 127.0f;
scales[row] = row_scale;
}
__syncthreads();
// Step 2: Quantize each element
float inv_scale = (row_scale > 0.0f) ? (127.0f / local_max) : 0.0f;
// Read row_scale from shared memory
inv_scale = (row_scale > 0.0f) ? (1.0f / row_scale) : 0.0f;
for (int col = threadIdx.x; col < K; col += blockDim.x) {
float val = __half2float(input[row * K + col]);
int q = __float2int_rn(val * inv_scale);
q = max(-128, min(127, q));
output[row * K + col] = (int8_t)q;
}
}
// Cost analysis for hidden_dim=4096:
// Step 1 (amax): 4096 reads + log2 reductions = ~4096 * 2 bytes = 8 KB per row
// Step 2 (quantize): 4096 reads + 4096 writes = 8 + 4 = 12 KB per row
// Total: ~20 KB memory traffic per row
// At 3.35 TB/s (H100 HBM): 20 KB / 3.35e12 = ~6 ns per row
// For batch_size=256, seq_len=1: 256 rows * 6 ns = 1.5 us per layer
// Model with 32 layers, 7 linear ops per layer: 32 * 7 * 1.5 us = 336 us total
// This is ~0.3 ms overhead on top of a ~5-15 ms forward pass
Dynamic Quantization Overhead (Llama-7B, H100, Single Forward Pass)
(microseconds)SmoothQuant: Making Static Quantization Work
SmoothQuant addresses the activation outlier problem that makes static quantization difficult. The key observation: weight distributions are smooth (easy to quantize), but activation distributions have outliers in specific channels that persist across inputs.
# SmoothQuant: migrate quantization difficulty from activations to weights
# Y = (X * diag(s)^{-1}) * (diag(s) * W)
# The s vector scales down outlier activation channels and scales up
# the corresponding weight channels to compensate.
def smooth_quant(model, activation_scales, alpha=0.5):
"""
activation_scales: per-channel max absolute activation value
shape [hidden_dim] per linear layer
alpha: migration strength (0 = all on weights, 1 = all on activations)
"""
for name, module in model.named_modules():
if not isinstance(module, torch.nn.Linear):
continue
act_scales = activation_scales[name] # [in_features]
weight_scales = module.weight.abs().max(dim=0).values # [in_features]
# Smoothing factor: s_j = max(|X_j|)^alpha / max(|W_:,j|)^(1-alpha)
s = (act_scales.pow(alpha) / weight_scales.pow(1 - alpha)).clamp(min=1e-5)
# Apply: divide activations by s (folded into preceding LayerNorm)
# Multiply weights by s
module.weight.data *= s.unsqueeze(0)
# Fold 1/s into the preceding LayerNorm's weight and bias
prev_layernorm = find_preceding_layernorm(model, name)
if prev_layernorm is not None:
prev_layernorm.weight.data /= s
if prev_layernorm.bias is not None:
prev_layernorm.bias.data /= s
return model
After SmoothQuant, the activation distribution is much more uniform, and static per-tensor quantization approaches dynamic per-token quality:
# Before SmoothQuant:
# Activation channel magnitudes (example): [0.5, 0.3, 45.2, 0.4, 0.6, 38.1, ...]
# Per-tensor scale = 45.2 / 127 = 0.356
# Channel with magnitude 0.3 gets quantized to: round(0.3/0.356) = round(0.84) = 1
# Effective quantization: 0.356 -> 18.7% relative error
# After SmoothQuant (alpha=0.5):
# Activation channel magnitudes: [1.2, 0.8, 4.8, 1.0, 1.4, 4.2, ...]
# Per-tensor scale = 4.8 / 127 = 0.038
# Channel with magnitude 0.8 gets quantized to: round(0.8/0.038) = round(21.1) = 21
# Effective quantization: 21 * 0.038 = 0.798 -> 0.25% relative error
SmoothQuant Effect on Static Quantization (Llama-2-7B W8A8)
| Configuration | PPL | Delta from FP16 | Throughput (H100) |
|---|---|---|---|
| FP16 baseline | 5.47 | 0 | 4200 tok/s |
| Static W8A8 (no smoothing) | 5.89 | +0.42 | 6800 tok/s |
| Static W8A8 + SmoothQuant (alpha=0.5) | 5.55 | +0.08 | 6750 tok/s |
| Dynamic W8A8 (per-token) | 5.49 | +0.02 | 6400 tok/s |
| Static W8A8 + SmoothQuant (alpha=0.75) | 5.52 | +0.05 | 6750 tok/s |
SmoothQuant’s alpha hyperparameter controls the trade-off. Higher alpha migrates more difficulty to weights (better for activation outlier models like OPT and BLOOM). Lower alpha keeps more on activations (better when weights already have outlier channels). For Llama-family models, alpha between 0.5 and 0.75 is typically optimal.
Hardware Kernel Support
Not all INT8 kernels support all quantization modes. The kernel determines whether you can use static or dynamic quantization.
cuBLAS INT8 GEMM (cublasLtMatmul):
- Supports per-tensor and per-column (per-channel) scales
- Static or dynamic: both work (you provide the scale)
- A_int8 [M,K] * B_int8 [K,N] -> C_int32 [M,N]
- Dequantization: C_fp = C_int32 * scale_A * scale_B
- Per-token: scale_A is [M,1], per-channel: scale_B is [1,N]
TensorRT INT8:
- Per-tensor static scales embedded in the engine
- No per-token dynamic support in the fused INT8 path
- Must use calibration (entropy or percentile)
FasterTransformer / TensorRT-LLM:
- SmoothQuant kernels support per-token dynamic activation scales
- Weight scales are per-channel (static)
- Fused attention + linear kernels with inline quantization
vLLM (W8A8):
- Uses cutlass INT8 GEMM with per-token activation scales
- Supports both dynamic per-token and static per-tensor
- SmoothQuant integration for static mode
llama.cpp:
- CPU INT8: per-block quantization (block_size=32)
- Dynamic within each block (scale computed per 32 elements)
- Hybrid approach: small blocks approximate per-token behavior
Kernel Performance Comparison
# Benchmark: static vs dynamic INT8 GEMM performance
# Matrix: M=256 (batch), K=4096 (input), N=4096 (output)
# Hardware: H100 SXM
# Static per-tensor:
# Scale computation: 0 (pre-computed)
# GEMM time: 42 us
# Dequant: included in epilogue (scale * output)
# Total: 42 us
# Dynamic per-token:
# Scale computation: 8 us (amax reduction over K=4096 for 256 rows)
# GEMM time: 42 us (same GEMM kernel, different scale application)
# Dequant: included in epilogue (per-row scale * output)
# Total: 50 us
# Dynamic per-token with fused kernel:
# Scale computation: fused into the GEMM prologue
# GEMM time: 45 us (slightly slower due to fused quantization)
# Total: 45 us
# The overhead of dynamic quantization is typically 5-15% per GEMM call
# At the model level: ~3-8% total latency overhead
The Per-Token + Per-Channel Regime
The dominant production configuration for W8A8 INT8 inference is per-token dynamic activation scaling with per-channel static weight scaling. This is the best accuracy-performance trade-off:
# W8A8 per-token x per-channel: the production standard
def w8a8_per_token_per_channel(x_fp16, weight_int8, weight_scales):
"""
x_fp16: [M, K] activation tensor (FP16)
weight_int8: [N, K] quantized weights (INT8)
weight_scales: [N] per-output-channel weight scales (FP32)
"""
# Dynamic per-token activation quantization
# One scale per row of x
act_absmax = x_fp16.abs().amax(dim=-1, keepdim=True) # [M, 1]
act_scales = act_absmax / 127.0 # [M, 1]
x_int8 = (x_fp16 / act_scales).round().clamp(-128, 127).to(torch.int8)
# INT8 GEMM: x_int8 [M,K] x weight_int8^T [K,N] -> out_int32 [M,N]
out_int32 = torch.matmul(x_int8.int(), weight_int8.int().T)
# Dequantize with per-token x per-channel scales
# out_fp = out_int32 * act_scales[M,1] * weight_scales[1,N]
out_fp = out_int32.float() * act_scales * weight_scales.unsqueeze(0)
return out_fp.half()
# Why this works:
# - Per-token handles activation outliers (tokens with high norms)
# - Per-channel handles weight outlier channels
# - The GEMM itself is pure INT8 (maximum hardware utilization)
# - Only the epilogue (dequantization) uses FP32 (cheap)
Per-token x per-channel is the sweet spot because the per-token scale computation is embarrassingly parallel (one reduction per row) and the per-channel weight scales are pre-computed constants. The GEMM kernel can fuse the dequantization into its epilogue with minimal overhead. This is what vLLM, TensorRT-LLM, and FasterTransformer all use for their W8A8 paths.
Decision Framework
# Decision tree for static vs dynamic quantization
def choose_quantization_mode(
model_family,
has_activation_outliers,
calibration_data_available,
calibration_matches_deployment,
latency_sensitivity,
hardware
):
# Rule 1: If no calibration data, dynamic is the only option
if not calibration_data_available:
return "dynamic_per_token"
# Rule 2: If calibration does not match deployment distribution
if not calibration_matches_deployment:
return "dynamic_per_token"
# Rule 3: If model has severe activation outliers (OPT, BLOOM)
if has_activation_outliers == "severe":
if hardware in ["H100", "A100"]:
return "static_with_smoothquant" # Best throughput
else:
return "dynamic_per_token" # Fallback
# Rule 4: If latency is critical and calibration is representative
if latency_sensitivity == "extreme":
return "static_per_tensor_with_entropy_calibration"
# Rule 5: Default for most LLM deployments
return "dynamic_per_token_per_channel"
Static vs Dynamic: Full Decision Matrix
| Scenario | Recommended Mode | Reason |
|---|---|---|
| Llama-2 on H100, batch serving | Static + SmoothQuant | Stable distributions, max throughput |
| Multilingual chatbot, mixed traffic | Dynamic per-token | Input distribution varies widely |
| Code generation (Codestral) | Dynamic per-token | Code activations differ from text calibration |
| TensorRT engine, fixed-size batches | Static (entropy calibration) | TensorRT prefers static, engine is fixed |
| Edge deployment, CPU inference | Static per-group (block_q8_0) | No dynamic support in most CPU kernels |
| OPT-175B with massive outliers | Static + SmoothQuant alpha=0.85 | SmoothQuant tames the outliers |
| Fine-tuned model, no calibration data | Dynamic per-token | Cannot trust calibration for fine-tuned weights |
Practical Implementation Comparison
Static Path (TensorRT-LLM with SmoothQuant)
# TensorRT-LLM static quantization pipeline
from tensorrt_llm.quantization import quantize_and_export
# Step 1: Collect activation statistics
calib_config = {
"algorithm": "smoothquant",
"smoothquant_alpha": 0.5,
"calibration_dataset": "cnn_dailymail",
"num_calibration_samples": 512,
"calibration_sequence_length": 2048,
}
# Step 2: Apply SmoothQuant + compute static scales
quantize_and_export(
model_dir="meta-llama/Llama-2-7b-hf",
output_dir="llama-7b-sq-int8",
quant_config={
"quant_algo": "W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN",
"calib_dataset": calib_config["calibration_dataset"],
"calib_samples": calib_config["num_calibration_samples"],
"smoothquant_alpha": calib_config["smoothquant_alpha"],
}
)
# Step 3: Build engine with static INT8
# Scales are baked into the engine binary
# No runtime scale computation needed
Dynamic Path (vLLM with Dynamic Quantization)
# vLLM dynamic quantization - no calibration step needed
from vllm import LLM, SamplingParams
# Just load the model with quantization config
# vLLM handles dynamic per-token scaling internally
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
quantization="int8", # W8A8 with dynamic per-token
dtype="float16",
max_model_len=4096,
)
# Or with a pre-quantized model (AutoGPTQ, etc):
llm = LLM(
model="TheBloke/Llama-2-7B-GPTQ",
quantization="gptq",
)
End-to-End Latency: Static vs Dynamic (Llama-2-7B, H100)
(ms (lower is better))Summary
Static quantization with SmoothQuant is the throughput-optimal choice when calibration data representative of the deployment distribution is available. Dynamic per-token quantization is the accuracy-optimal and safest choice, costing only 3-8% throughput overhead. The per-token activation scale combined with per-channel weight scale has become the default production configuration for W8A8 inference because it handles activation outliers gracefully without requiring offline calibration. Use static when you control the input distribution and need every microsecond of latency. Use dynamic when the input distribution is unpredictable or when you cannot afford calibration pipeline maintenance.