Part of Series Quantization Masterclass 22 of 30
1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest — Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier — Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 — How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization — The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization — OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment

Static activation quantization uses fixed scale factors computed from calibration data—same scale for every input, every token, every request. Dynamic quantization recomputes scales at runtime for each input tensor. The accuracy difference can be dramatic: on OPT-13B, static per-tensor INT8 adds 2.1 perplexity points, while dynamic per-token INT8 adds 0.3 points—7x lower quality loss. The throughput cost is the min/max/reduce operation to compute scales on-the-fly, which adds 2-5% overhead on A100 but 8-12% overhead on older architectures without fast reduce ops. This is why vLLM uses dynamic per-token quantization by default for activations despite the compute overhead—the quality win justifies the throughput loss.

This post covers the quantization parameter problem, the calibration methods for static quantization (minmax, percentile, MSE-optimal), the runtime overhead of dynamic quantization, SmoothQuant as a hybrid approach that bridges the gap, and production decision criteria with benchmarks on Llama and Mistral models.

The Quantization Parameter Problem

Uniform Affine Quantization Review

For INT8 quantization, each floating-point tensor is mapped to 8-bit integers using a scale ss and zero-point zz:

xq=clamp(round(xs)+z,128,127)x_q = \text{clamp}\left(\text{round}\left(\frac{x}{s}\right) + z, \, -128, \, 127\right)

xdq=s(xqz)x_{dq} = s \cdot (x_q - z)

The scale is determined by the tensor’s range:

s=xmaxxmin2b1s = \frac{x_{max} - x_{min}}{2^b - 1}

For symmetric quantization (zero-point = 0), which is more common in INT8 GEMM kernels:

s=max(x)2b11s = \frac{\max(|x|)}{2^{b-1} - 1}

The question is: how do you determine xmaxx_{max} and xminx_{min} for activations?

# The fundamental difference:

# STATIC: scales are constants, computed once from calibration data
class StaticQuantLinear:
    def __init__(self, weight_int8, weight_scale, act_scale):
        self.weight_int8 = weight_int8    # [out, in] INT8
        self.weight_scale = weight_scale  # [out] or scalar
        self.act_scale = act_scale        # scalar (fixed at calibration time)

    def forward(self, x_fp16):
        # Quantize activation using pre-computed scale
        x_int8 = torch.round(x_fp16 / self.act_scale).clamp(-128, 127).to(torch.int8)
        # INT8 GEMM
        out_int32 = torch.matmul(x_int8, self.weight_int8.T)  # INT8 x INT8 -> INT32
        # Dequantize
        out_fp16 = out_int32.float() * (self.act_scale * self.weight_scale)
        return out_fp16.half()

# DYNAMIC: scales are computed per-input at runtime
class DynamicQuantLinear:
    def __init__(self, weight_int8, weight_scale):
        self.weight_int8 = weight_int8
        self.weight_scale = weight_scale
        # No act_scale stored - computed at runtime

    def forward(self, x_fp16):
        # Compute scale from the actual input (runtime overhead)
        act_scale = x_fp16.abs().max() / 127.0
        x_int8 = torch.round(x_fp16 / act_scale).clamp(-128, 127).to(torch.int8)
        out_int32 = torch.matmul(x_int8, self.weight_int8.T)
        out_fp16 = out_int32.float() * (act_scale * self.weight_scale)
        return out_fp16.half()

Static Quantization: Offline Calibration

Calibration Methods

Static quantization runs a calibration dataset through the model and records activation statistics. The choice of how to compute the scale from these statistics has a significant impact on quality.

# Method 1: Min-Max (simplest, often worst)
# Use the global min/max across all calibration samples
class MinMaxCalibrator:
    def __init__(self):
        self.running_min = float('inf')
        self.running_max = float('-inf')

    def observe(self, tensor):
        self.running_min = min(self.running_min, tensor.min().item())
        self.running_max = max(self.running_max, tensor.max().item())

    def compute_scale(self):
        # Symmetric quantization
        abs_max = max(abs(self.running_min), abs(self.running_max))
        return abs_max / 127.0

# Problem: a single outlier in any calibration sample determines the range
# for ALL future inputs. Outliers waste dynamic range.

# Method 2: Percentile (clip outliers)
class PercentileCalibrator:
    def __init__(self, percentile=99.99):
        self.percentile = percentile
        self.all_values = []

    def observe(self, tensor):
        self.all_values.append(tensor.flatten().float().cpu())

    def compute_scale(self):
        all_vals = torch.cat(self.all_values)
        abs_max = torch.quantile(all_vals.abs(), self.percentile / 100.0)
        return (abs_max / 127.0).item()

# Clips the top 0.01% of values. Those outliers are saturated to +/-127.
# Works well when outliers are rare and the clipping error is small
# relative to the range expansion error of accommodating them.

# Method 3: MSE minimization (find scale that minimizes quantization error)
class MSECalibrator:
    def __init__(self, n_bins=2048):
        self.n_bins = n_bins
        self.histograms = []

    def observe(self, tensor):
        # Build histogram of absolute values
        hist = torch.histc(tensor.abs().float(), bins=self.n_bins, min=0,
                           max=tensor.abs().max().item())
        self.histograms.append((hist, tensor.abs().max().item()))

    def compute_scale(self):
        # Try different clipping thresholds and pick the one
        # that minimizes the mean squared quantization error
        best_scale = None
        best_mse = float('inf')

        # Aggregate histogram
        # ... merge histograms across samples ...

        for clip_ratio in [i / 100.0 for i in range(80, 101)]:
            trial_max = global_abs_max * clip_ratio
            trial_scale = trial_max / 127.0

            # Compute MSE of this quantization
            mse = compute_quantization_mse(aggregated_hist, trial_scale)
            if mse < best_mse:
                best_mse = mse
                best_scale = trial_scale

        return best_scale

# Method 4: Entropy / KL-divergence (TensorRT's default)
class EntropyCalibrator:
    """
    Find the clipping threshold that minimizes the KL divergence
    between the original distribution and the quantized distribution.
    """
    def __init__(self, n_bins=8192):
        self.n_bins = n_bins
        self.histogram = None

    def observe(self, tensor):
        vals = tensor.abs().float()
        hist = torch.histc(vals, bins=self.n_bins, min=0, max=vals.max().item())
        if self.histogram is None:
            self.histogram = hist
        else:
            self.histogram += hist

    def compute_scale(self):
        # Try each possible number of bins as the quantized range
        # and find the one that minimizes KL(original || quantized)
        reference = self.histogram.clone()
        reference /= reference.sum()  # normalize to distribution

        best_threshold_bin = self.n_bins
        best_kl = float('inf')

        for num_quantized_bins in range(128, self.n_bins + 1):
            # Quantize the histogram to 128 levels up to num_quantized_bins
            # and compute KL divergence
            kl = compute_kl_divergence(reference, num_quantized_bins, 128)
            if kl < best_kl:
                best_kl = kl
                best_threshold_bin = num_quantized_bins

        # Convert bin index to threshold
        threshold = best_threshold_bin / self.n_bins * global_abs_max
        return threshold / 127.0
📊

Static Calibration Method Comparison (Llama-2-7B W8A8)

Calibration MethodWikiText-2 PPLCalibration TimeNotes
FP16 baseline 5.47 N/A Reference
Min-Max 5.89 (+0.42) 2 min Outlier-sensitive
Percentile (99.99%) 5.62 (+0.15) 5 min Good general choice
MSE minimization 5.58 (+0.11) 15 min Best for uniform-ish distributions
KL-divergence (entropy) 5.55 (+0.08) 20 min TensorRT default, best overall
Dynamic (per-token) 5.49 (+0.02) N/A Near-lossless, runtime cost

The Calibration Dataset Problem

Static calibration has a fundamental fragility: the scales are only optimal for inputs similar to the calibration data. If the deployment distribution differs, the scales may be badly wrong.

# Demonstration: calibration distribution mismatch

# Calibrate on English Wikipedia text
model_wiki_calibrated = static_quantize(model, calib_data=wiki_samples)

# Evaluate on different distributions:
eval_results = {
    "WikiText-2 (in-distribution)":  5.55,   # Good: matches calibration
    "Code (Python)":                 8.12,   # Worse: different activation ranges
    "Mathematical proofs":           9.45,   # Much worse: many outliers in math tokens
    "Chinese text":                  7.89,   # Worse: different token embeddings
    "Mixed (real deployment traffic)": 6.82, # Moderate: average of distributions
}

# With dynamic quantization:
eval_results_dynamic = {
    "WikiText-2":                    5.49,   # Slightly better
    "Code (Python)":                 5.52,   # Much better: adapts to code activations
    "Mathematical proofs":           5.68,   # Much better: handles math outliers
    "Chinese text":                  5.51,   # Much better: adapts to CJK embeddings
    "Mixed (real deployment traffic)": 5.52, # Consistent across distributions
}
⚠️ Warning

If your deployment serves heterogeneous traffic (multiple languages, code, math, structured data), dynamic quantization may be necessary to avoid quality regressions on out-of-calibration inputs. The calibration dataset must represent the full deployment distribution for static quantization to work well.

Dynamic Quantization: Per-Token Online Scaling

Granularity Options

Dynamic quantization can operate at different granularities, each with a different accuracy-overhead trade-off:

# Per-tensor dynamic: one scale for the entire activation tensor
# Cheapest but worst accuracy (same problem as static with outliers)
def dynamic_per_tensor(x):
    scale = x.abs().max() / 127.0
    return torch.round(x / scale).clamp(-128, 127).to(torch.int8), scale

# Per-token dynamic: one scale per token (row) in the activation matrix
# Good accuracy-overhead balance. Most common in practice.
def dynamic_per_token(x):
    # x shape: [batch * seq_len, hidden_dim]
    # One scale per row
    scales = x.abs().amax(dim=-1, keepdim=True) / 127.0
    x_int8 = torch.round(x / scales).clamp(-128, 127).to(torch.int8)
    return x_int8, scales  # scales shape: [batch * seq_len, 1]

# Per-group dynamic: one scale per group of elements within each token
# Best accuracy but highest overhead
def dynamic_per_group(x, group_size=128):
    # x shape: [batch * seq_len, hidden_dim]
    B, D = x.shape
    x_reshaped = x.view(B, D // group_size, group_size)
    scales = x_reshaped.abs().amax(dim=-1, keepdim=True) / 127.0
    x_int8 = torch.round(x_reshaped / scales).clamp(-128, 127).to(torch.int8)
    return x_int8.view(B, D), scales.view(B, D // group_size)

Runtime Overhead

The cost of dynamic quantization is the amax reduction plus the quantization (divide, round, clamp) applied to every activation tensor at every layer:

// Per-token dynamic quantization kernel
// This runs at every linear layer, before the INT8 GEMM

__global__ void quantize_per_token_int8(
    const half* __restrict__ input,   // [M, K]
    int8_t* __restrict__ output,       // [M, K]
    float* __restrict__ scales,        // [M]
    int M, int K
) {
    int row = blockIdx.x;
    if (row >= M) return;

    // Step 1: Find max absolute value in this row (reduction)
    float local_max = 0.0f;
    for (int col = threadIdx.x; col < K; col += blockDim.x) {
        float val = __half2float(input[row * K + col]);
        local_max = fmaxf(local_max, fabsf(val));
    }

    // Warp-level reduction to find row max
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        local_max = fmaxf(local_max, __shfl_down_sync(0xffffffff, local_max, offset));
    }

    // Block-level reduction (across warps)
    __shared__ float warp_maxes[32];
    int warp_id = threadIdx.x / warpSize;
    int lane_id = threadIdx.x % warpSize;
    if (lane_id == 0) warp_maxes[warp_id] = local_max;
    __syncthreads();

    if (warp_id == 0) {
        local_max = (lane_id < blockDim.x / warpSize) ? warp_maxes[lane_id] : 0.0f;
        for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
            local_max = fmaxf(local_max, __shfl_down_sync(0xffffffff, local_max, offset));
        }
    }

    __shared__ float row_scale;
    if (threadIdx.x == 0) {
        row_scale = local_max / 127.0f;
        scales[row] = row_scale;
    }
    __syncthreads();

    // Step 2: Quantize each element
    float inv_scale = (row_scale > 0.0f) ? (127.0f / local_max) : 0.0f;
    // Read row_scale from shared memory
    inv_scale = (row_scale > 0.0f) ? (1.0f / row_scale) : 0.0f;

    for (int col = threadIdx.x; col < K; col += blockDim.x) {
        float val = __half2float(input[row * K + col]);
        int q = __float2int_rn(val * inv_scale);
        q = max(-128, min(127, q));
        output[row * K + col] = (int8_t)q;
    }
}

// Cost analysis for hidden_dim=4096:
// Step 1 (amax): 4096 reads + log2 reductions = ~4096 * 2 bytes = 8 KB per row
// Step 2 (quantize): 4096 reads + 4096 writes = 8 + 4 = 12 KB per row
// Total: ~20 KB memory traffic per row
// At 3.35 TB/s (H100 HBM): 20 KB / 3.35e12 = ~6 ns per row
// For batch_size=256, seq_len=1: 256 rows * 6 ns = 1.5 us per layer
// Model with 32 layers, 7 linear ops per layer: 32 * 7 * 1.5 us = 336 us total
// This is ~0.3 ms overhead on top of a ~5-15 ms forward pass

Dynamic Quantization Overhead (Llama-7B, H100, Single Forward Pass)

(microseconds)
BS=1, Decode 0.5% of 9ms step
45 microseconds
BS=32, Decode 1.2% of 15ms step
180 microseconds
BS=1, Prefill 2048 2.1% of 40ms step
850 microseconds
BS=32, Prefill 2048 3.5% of 68ms step
2,400 microseconds

SmoothQuant: Making Static Quantization Work

SmoothQuant addresses the activation outlier problem that makes static quantization difficult. The key observation: weight distributions are smooth (easy to quantize), but activation distributions have outliers in specific channels that persist across inputs.

# SmoothQuant: migrate quantization difficulty from activations to weights
# Y = (X * diag(s)^{-1}) * (diag(s) * W)
# The s vector scales down outlier activation channels and scales up
# the corresponding weight channels to compensate.

def smooth_quant(model, activation_scales, alpha=0.5):
    """
    activation_scales: per-channel max absolute activation value
                       shape [hidden_dim] per linear layer
    alpha: migration strength (0 = all on weights, 1 = all on activations)
    """
    for name, module in model.named_modules():
        if not isinstance(module, torch.nn.Linear):
            continue

        act_scales = activation_scales[name]  # [in_features]
        weight_scales = module.weight.abs().max(dim=0).values  # [in_features]

        # Smoothing factor: s_j = max(|X_j|)^alpha / max(|W_:,j|)^(1-alpha)
        s = (act_scales.pow(alpha) / weight_scales.pow(1 - alpha)).clamp(min=1e-5)

        # Apply: divide activations by s (folded into preceding LayerNorm)
        # Multiply weights by s
        module.weight.data *= s.unsqueeze(0)

        # Fold 1/s into the preceding LayerNorm's weight and bias
        prev_layernorm = find_preceding_layernorm(model, name)
        if prev_layernorm is not None:
            prev_layernorm.weight.data /= s
            if prev_layernorm.bias is not None:
                prev_layernorm.bias.data /= s

    return model

After SmoothQuant, the activation distribution is much more uniform, and static per-tensor quantization approaches dynamic per-token quality:

# Before SmoothQuant:
# Activation channel magnitudes (example): [0.5, 0.3, 45.2, 0.4, 0.6, 38.1, ...]
# Per-tensor scale = 45.2 / 127 = 0.356
# Channel with magnitude 0.3 gets quantized to: round(0.3/0.356) = round(0.84) = 1
# Effective quantization: 0.356 -> 18.7% relative error

# After SmoothQuant (alpha=0.5):
# Activation channel magnitudes: [1.2, 0.8, 4.8, 1.0, 1.4, 4.2, ...]
# Per-tensor scale = 4.8 / 127 = 0.038
# Channel with magnitude 0.8 gets quantized to: round(0.8/0.038) = round(21.1) = 21
# Effective quantization: 21 * 0.038 = 0.798 -> 0.25% relative error
📊

SmoothQuant Effect on Static Quantization (Llama-2-7B W8A8)

ConfigurationPPLDelta from FP16Throughput (H100)
FP16 baseline 5.47 0 4200 tok/s
Static W8A8 (no smoothing) 5.89 +0.42 6800 tok/s
Static W8A8 + SmoothQuant (alpha=0.5) 5.55 +0.08 6750 tok/s
Dynamic W8A8 (per-token) 5.49 +0.02 6400 tok/s
Static W8A8 + SmoothQuant (alpha=0.75) 5.52 +0.05 6750 tok/s
ℹ️ Note

SmoothQuant’s alpha hyperparameter controls the trade-off. Higher alpha migrates more difficulty to weights (better for activation outlier models like OPT and BLOOM). Lower alpha keeps more on activations (better when weights already have outlier channels). For Llama-family models, alpha between 0.5 and 0.75 is typically optimal.

Hardware Kernel Support

Not all INT8 kernels support all quantization modes. The kernel determines whether you can use static or dynamic quantization.

cuBLAS INT8 GEMM (cublasLtMatmul):
  - Supports per-tensor and per-column (per-channel) scales
  - Static or dynamic: both work (you provide the scale)
  - A_int8 [M,K] * B_int8 [K,N] -> C_int32 [M,N]
  - Dequantization: C_fp = C_int32 * scale_A * scale_B
  - Per-token: scale_A is [M,1], per-channel: scale_B is [1,N]

TensorRT INT8:
  - Per-tensor static scales embedded in the engine
  - No per-token dynamic support in the fused INT8 path
  - Must use calibration (entropy or percentile)

FasterTransformer / TensorRT-LLM:
  - SmoothQuant kernels support per-token dynamic activation scales
  - Weight scales are per-channel (static)
  - Fused attention + linear kernels with inline quantization

vLLM (W8A8):
  - Uses cutlass INT8 GEMM with per-token activation scales
  - Supports both dynamic per-token and static per-tensor
  - SmoothQuant integration for static mode

llama.cpp:
  - CPU INT8: per-block quantization (block_size=32)
  - Dynamic within each block (scale computed per 32 elements)
  - Hybrid approach: small blocks approximate per-token behavior

Kernel Performance Comparison

# Benchmark: static vs dynamic INT8 GEMM performance
# Matrix: M=256 (batch), K=4096 (input), N=4096 (output)
# Hardware: H100 SXM

# Static per-tensor:
#   Scale computation: 0 (pre-computed)
#   GEMM time: 42 us
#   Dequant: included in epilogue (scale * output)
#   Total: 42 us

# Dynamic per-token:
#   Scale computation: 8 us (amax reduction over K=4096 for 256 rows)
#   GEMM time: 42 us (same GEMM kernel, different scale application)
#   Dequant: included in epilogue (per-row scale * output)
#   Total: 50 us

# Dynamic per-token with fused kernel:
#   Scale computation: fused into the GEMM prologue
#   GEMM time: 45 us (slightly slower due to fused quantization)
#   Total: 45 us

# The overhead of dynamic quantization is typically 5-15% per GEMM call
# At the model level: ~3-8% total latency overhead

The Per-Token + Per-Channel Regime

The dominant production configuration for W8A8 INT8 inference is per-token dynamic activation scaling with per-channel static weight scaling. This is the best accuracy-performance trade-off:

# W8A8 per-token x per-channel: the production standard

def w8a8_per_token_per_channel(x_fp16, weight_int8, weight_scales):
    """
    x_fp16: [M, K] activation tensor (FP16)
    weight_int8: [N, K] quantized weights (INT8)
    weight_scales: [N] per-output-channel weight scales (FP32)
    """
    # Dynamic per-token activation quantization
    # One scale per row of x
    act_absmax = x_fp16.abs().amax(dim=-1, keepdim=True)  # [M, 1]
    act_scales = act_absmax / 127.0                         # [M, 1]
    x_int8 = (x_fp16 / act_scales).round().clamp(-128, 127).to(torch.int8)

    # INT8 GEMM: x_int8 [M,K] x weight_int8^T [K,N] -> out_int32 [M,N]
    out_int32 = torch.matmul(x_int8.int(), weight_int8.int().T)

    # Dequantize with per-token x per-channel scales
    # out_fp = out_int32 * act_scales[M,1] * weight_scales[1,N]
    out_fp = out_int32.float() * act_scales * weight_scales.unsqueeze(0)

    return out_fp.half()

# Why this works:
# - Per-token handles activation outliers (tokens with high norms)
# - Per-channel handles weight outlier channels
# - The GEMM itself is pure INT8 (maximum hardware utilization)
# - Only the epilogue (dequantization) uses FP32 (cheap)
Performance

Per-token x per-channel is the sweet spot because the per-token scale computation is embarrassingly parallel (one reduction per row) and the per-channel weight scales are pre-computed constants. The GEMM kernel can fuse the dequantization into its epilogue with minimal overhead. This is what vLLM, TensorRT-LLM, and FasterTransformer all use for their W8A8 paths.

Decision Framework

# Decision tree for static vs dynamic quantization

def choose_quantization_mode(
    model_family,
    has_activation_outliers,
    calibration_data_available,
    calibration_matches_deployment,
    latency_sensitivity,
    hardware
):
    # Rule 1: If no calibration data, dynamic is the only option
    if not calibration_data_available:
        return "dynamic_per_token"

    # Rule 2: If calibration does not match deployment distribution
    if not calibration_matches_deployment:
        return "dynamic_per_token"

    # Rule 3: If model has severe activation outliers (OPT, BLOOM)
    if has_activation_outliers == "severe":
        if hardware in ["H100", "A100"]:
            return "static_with_smoothquant"  # Best throughput
        else:
            return "dynamic_per_token"  # Fallback

    # Rule 4: If latency is critical and calibration is representative
    if latency_sensitivity == "extreme":
        return "static_per_tensor_with_entropy_calibration"

    # Rule 5: Default for most LLM deployments
    return "dynamic_per_token_per_channel"
📊

Static vs Dynamic: Full Decision Matrix

ScenarioRecommended ModeReason
Llama-2 on H100, batch serving Static + SmoothQuant Stable distributions, max throughput
Multilingual chatbot, mixed traffic Dynamic per-token Input distribution varies widely
Code generation (Codestral) Dynamic per-token Code activations differ from text calibration
TensorRT engine, fixed-size batches Static (entropy calibration) TensorRT prefers static, engine is fixed
Edge deployment, CPU inference Static per-group (block_q8_0) No dynamic support in most CPU kernels
OPT-175B with massive outliers Static + SmoothQuant alpha=0.85 SmoothQuant tames the outliers
Fine-tuned model, no calibration data Dynamic per-token Cannot trust calibration for fine-tuned weights

Practical Implementation Comparison

Static Path (TensorRT-LLM with SmoothQuant)

# TensorRT-LLM static quantization pipeline
from tensorrt_llm.quantization import quantize_and_export

# Step 1: Collect activation statistics
calib_config = {
    "algorithm": "smoothquant",
    "smoothquant_alpha": 0.5,
    "calibration_dataset": "cnn_dailymail",
    "num_calibration_samples": 512,
    "calibration_sequence_length": 2048,
}

# Step 2: Apply SmoothQuant + compute static scales
quantize_and_export(
    model_dir="meta-llama/Llama-2-7b-hf",
    output_dir="llama-7b-sq-int8",
    quant_config={
        "quant_algo": "W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN",
        "calib_dataset": calib_config["calibration_dataset"],
        "calib_samples": calib_config["num_calibration_samples"],
        "smoothquant_alpha": calib_config["smoothquant_alpha"],
    }
)

# Step 3: Build engine with static INT8
# Scales are baked into the engine binary
# No runtime scale computation needed

Dynamic Path (vLLM with Dynamic Quantization)

# vLLM dynamic quantization - no calibration step needed
from vllm import LLM, SamplingParams

# Just load the model with quantization config
# vLLM handles dynamic per-token scaling internally
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    quantization="int8",            # W8A8 with dynamic per-token
    dtype="float16",
    max_model_len=4096,
)

# Or with a pre-quantized model (AutoGPTQ, etc):
llm = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
)

End-to-End Latency: Static vs Dynamic (Llama-2-7B, H100)

(ms (lower is better))
FP16 baseline (decode BS=1) Reference
9.2 ms (lower is better)
Static W8A8 + SmoothQuant 34% faster
6.1 ms (lower is better)
Dynamic W8A8 per-token 29% faster
6.5 ms (lower is better)
Static W8A8 (no smoothing) Fastest but +0.42 PPL
6 ms (lower is better)
Dynamic W8A8 per-tensor Worse accuracy than per-token
6.3 ms (lower is better)

Summary

Static quantization with SmoothQuant is the throughput-optimal choice when calibration data representative of the deployment distribution is available. Dynamic per-token quantization is the accuracy-optimal and safest choice, costing only 3-8% throughput overhead. The per-token activation scale combined with per-channel weight scale has become the default production configuration for W8A8 inference because it handles activation outliers gracefully without requiring offline calibration. Use static when you control the input distribution and need every microsecond of latency. Use dynamic when the input distribution is unpredictable or when you cannot afford calibration pipeline maintenance.