Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment

Part of Series Inference Optimization Timeline 3 of 23

1 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 2 KV Cache: The Hidden Memory Giant in LLM Serving 3 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 4 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 5 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 6 Continuous Batching: The Complete Guide to LLM Inference Scheduling 7 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 8 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 9 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 10 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 11 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 12 Mamba and State Space Models: The O(n) Alternative to Attention 13 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 14 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 15 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 16 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 17 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 18 Memory Pool Management: Slab Allocators for GPU Inference 19 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 20 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 21 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 22 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 23 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification

Large language models are enormous. A 70-billion-parameter model stored in FP16 occupies 140 GB of GPU memory — more than a single A100 80 GB card can hold. Quantization is the single most impactful technique for making these models practical: it shrinks their memory footprint, accelerates inference, and in many cases costs almost nothing in output quality. This post is a deep, technical treatment of how quantization works for LLM inference, what methods exist, which hardware supports what, and how to make the right choices for production.

Why Quantization Matters for LLMs

LLM inference has two distinct phases: prefill (processing the prompt) and decode (generating tokens one at a time). Prefill is compute-bound — you are doing a large batched matrix multiplication over all prompt tokens simultaneously. Decode is memory-bandwidth-bound — each generated token requires reading the entire model’s weights from GPU memory, but only performs a single matrix-vector product per layer.

This memory-bandwidth bottleneck is the reason quantization matters so much. During decode, the GPU spends most of its time waiting for weights to arrive from HBM. Every byte you shave off each parameter translates almost linearly into faster token generation.

💡 Tip

The core insight: During autoregressive decode, inference throughput is determined by how fast you can stream weights from memory, not by how fast you can multiply numbers. Quantization reduces the bytes per parameter, directly increasing throughput.

The arithmetic is straightforward:

📊

Memory Footprint by Precision — Llama 2 70B

Precision	Bits/Param	Model Size	GPUs Needed (80 GB)	Relative Decode Throughput
FP32	32	280 GB	4x A100	0.5x (baseline)
BF16/FP16	16	140 GB	2x A100	1.0x
INT8 / FP8	8	70 GB	1x A100	~2.0x
INT4 (packed)	4	35 GB	1x A100 (half)	~3.5-4.0x

Going from FP16 to INT8 halves the memory, which means the model fits on half the GPUs. Going to INT4 quarters it. For a 70B model, the difference between needing two A100-80GB cards and needing one is enormous in terms of cost per token. At cloud GPU prices of roughly $2-3/hr per A100, halving GPU count halves your inference cost.

But memory savings are only half the story. Because decode is bandwidth-bound, moving half the bytes means tokens come out roughly twice as fast. On an A100 with ~2 TB/s HBM bandwidth, a 140 GB FP16 model takes ~70 ms just to read all weights once (one decode step). At INT8, that drops to ~35 ms. At INT4, ~17.5 ms. These are theoretical lower bounds, but real-world measurements track closely.

Theoretical Decode Latency vs. Precision (70B Model, A100 80GB)

Metric	FP16 (140 GB)	INT8 (70 GB)	INT4 (35 GB)
Time to Read Weights (ms)	70	35	17.5

Quantization Fundamentals

Before diving into specific LLM methods, we need a solid understanding of the underlying math.

Linear (Uniform) Quantization

The most common form of quantization is linear quantization, which maps a continuous range of floating-point values to a discrete set of integers. The mapping is defined by two parameters: a scale factor $s$ and a zero point $z$ :

$x_q = \text{round}\left(\frac{x}{s}\right) + z$

To dequantize (reconstruct the approximate floating-point value):

$\hat{x} = s \cdot (x_q - z)$

The scale $s$ determines how much real-valued range each integer step covers. The zero point $z$ shifts the integer range so that the real value 0.0 maps exactly to an integer — this is important because neural networks are full of zeros (ReLU outputs, padding, sparse activations) and you want zero to be represented without error.

import torch

def linear_quantize(tensor: torch.Tensor, num_bits: int = 8, symmetric: bool = True):
    """Linear quantization of a tensor to num_bits precision."""
    if symmetric:
        # Symmetric: zero point is 0, range is [-qmax, qmax]
        qmax = (1 << (num_bits - 1)) - 1  # e.g., 127 for INT8
        scale = tensor.abs().max() / qmax
        zero_point = 0
    else:
        # Asymmetric: full range [qmin, qmax] maps to [tensor.min(), tensor.max()]
        qmin = 0
        qmax = (1 << num_bits) - 1  # e.g., 255 for UINT8
        scale = (tensor.max() - tensor.min()) / (qmax - qmin)
        zero_point = round(-tensor.min() / scale)

    quantized = torch.clamp(
        torch.round(tensor / scale) + zero_point,
        -(1 << (num_bits - 1)) if symmetric else 0,
        (1 << (num_bits - 1)) - 1 if symmetric else (1 << num_bits) - 1,
    )
    return quantized.to(torch.int8 if num_bits <= 8 else torch.int16), scale, zero_point

def linear_dequantize(quantized: torch.Tensor, scale: float, zero_point: int):
    """Dequantize back to floating point."""
    return scale * (quantized.float() - zero_point)

Symmetric vs. Asymmetric Quantization

Symmetric quantization sets $z = 0$ and maps the range $[-\alpha, \alpha]$ to $[-2^{b-1}+1, 2^{b-1}-1]$ , where $\alpha = \max(|x|)$ . This is simpler — dequantization is just $\hat{x} = s \cdot x_q$ with no subtraction needed. The downside: if the distribution is skewed (e.g., all positive values from a ReLU), you waste half the integer range.

Asymmetric quantization uses the full integer range by mapping $[\text{min}(x), \text{max}(x)]$ to $[0, 2^b - 1]$ . This is more accurate for skewed distributions but adds the cost of storing and computing with the zero point.

ℹ️ Note

In practice for LLMs: Weights tend to be roughly symmetric around zero, so symmetric quantization works well for weights. Activations can be skewed (especially post-ReLU or post-GeLU), so asymmetric quantization sometimes helps for activations — though many production systems use symmetric for both to keep kernels simple.

Granularity: Per-Tensor vs. Per-Channel vs. Per-Group

The scale factor and zero point can be computed at different granularities, with a direct tradeoff between accuracy and storage overhead:

Per-tensor: One scale and zero point for the entire weight matrix. Simplest, lowest overhead, but worst accuracy — a single outlier in a million-element tensor sets the scale for everything.

Per-channel (per-row/per-column): One scale per output channel of a linear layer. For a weight matrix $W \in \mathbb{R}^{m \times n}$ , this means $m$ scales. This is the standard for INT8 weight quantization and works well because different output neurons can have very different magnitude distributions.

Per-group: Divide each row into groups of $g$ elements (commonly $g = 128$ ) and compute a scale per group. For a weight matrix with $n$ columns, this means $n/g$ scales per row, or $m \cdot n / g$ total. Per-group quantization is the standard for INT4 weight quantization (GPTQ, AWQ) because at 4-bit precision, per-channel is not fine-grained enough.

📊

Quantization Granularity Tradeoffs

Granularity	Scales Stored (m x n matrix)	Overhead at INT4	Typical Use
Per-tensor	1	Negligible	INT8 activations
Per-channel	m	~0.4% (FP16 scales)	INT8 weights
Per-group (g=128)	m * n / 128	~3.1% (FP16 scales)	INT4 weights (GPTQ, AWQ)
Per-group (g=32)	m * n / 32	~12.5% (FP16 scales)	Ultra-low-bit (2-3 bit)

The overhead column matters: with per-group quantization at group size 128 and FP16 scales, each group of 128 4-bit values (64 bytes) also needs a 2-byte scale, adding about 3.1% overhead. At group size 32, the overhead climbs to 12.5%. This is why group size 128 is the sweet spot for INT4.

Why LLMs Are Hard to Quantize: The Outlier Problem

Naive quantization of LLM activations often fails catastrophically, and the reason was identified by Dettmers et al. (2022) and later by the SmoothQuant paper (Xiao et al., 2023): LLMs develop massive activation outliers.

In transformer models, certain hidden-state channels consistently produce values 10-100x larger than the rest. For example, in a layer where most activation values fall in $[-5, 5]$ , a few channels might hit $[-60, 60]$ . If you quantize per-tensor, the scale must accommodate those outliers, meaning the vast majority of “normal” values get crushed into a tiny fraction of the integer range — destroying information.

These outliers appear in specific channels (feature dimensions) and are consistent across tokens, which means they are a property of the model weights, not the input data. This insight is what drives SmoothQuant and related methods: if the outliers are in known channels, you can mathematically redistribute the quantization difficulty.

⚠️ Warning

Key insight: The difficulty of quantizing LLMs is not about the weights — weights are usually well-behaved and easy to quantize. The difficulty is in the activations, which develop systematic outlier channels. This is why weight-only quantization (W4A16, W8A16) is so much easier than weight+activation quantization (W8A8, W4A4).

Weight-Only Quantization: W4A16 and W8A16

Weight-only quantization stores the model weights in low precision (INT4 or INT8) but performs the actual matrix multiplication in FP16. At inference time, each weight group is dequantized on-the-fly to FP16 before the GEMM. Because weight loading from HBM is the bottleneck during decode, you still get the bandwidth savings — the dequantization happens in registers or shared memory, which is much faster.

This approach sidesteps the activation outlier problem entirely: activations stay in FP16 throughout.

Round-to-Nearest (RTN)

The simplest baseline: just round each weight to the nearest quantized value. Surprisingly, RTN works reasonably well at INT8. At INT4, it fails badly — a 70B model might lose 1-2 perplexity points on WikiText, and smaller models degrade much more.

GPTQ: One-Shot Post-Training Quantization

GPTQ (Frantar et al., 2023) is the most widely used INT4 weight quantization method. It builds on the Optimal Brain Quantization (OBQ) framework, which quantizes weights one at a time and adjusts the remaining unquantized weights to compensate for each quantization error.

How GPTQ works, step by step:

Collect calibration data: Run a small calibration set (typically 128 samples from C4 or WikiText) through the model and record the input activations $X$ to each linear layer.
Compute the Hessian approximation: For each linear layer with weight $W$ and calibration inputs $X$ , compute $H = 2 X X^T$ (the Hessian of the layer-wise reconstruction error $\|WX - \hat{W}X\|^2$ with respect to $W$ ).
Quantize column by column: Process the weight matrix one column at a time. For each column $j$ :
- Quantize $w_j$ to $\hat{w}_j$ using round-to-nearest with the chosen bit-width and group size.
- Compute the quantization error: $\delta_j = w_j - \hat{w}_j$ .
- Compensate by adjusting all remaining unquantized columns: $W_{:, j+1:} \mathrel{+}= -\frac{\delta_j}{H_{jj}} \cdot H_{j, j+1:}$ . This is the key step — it uses the Hessian to distribute the error in a way that minimizes the impact on the layer’s output.
Repeat for every linear layer in the model, processing layers sequentially (so earlier layers are quantized before later ones).

The Hessian-based error compensation is what makes GPTQ dramatically better than RTN at INT4. The entire process takes minutes on a single GPU, which is why it is called “one-shot” — no iterative training loop is required.

# Simplified GPTQ pseudocode for a single linear layer
def gptq_quantize_layer(W, X_cal, num_bits=4, group_size=128):
    """
    W: weight matrix (out_features x in_features)
    X_cal: calibration inputs (in_features x num_samples)
    """
    H = 2 * X_cal @ X_cal.T  # Hessian approximation
    H_inv = torch.linalg.cholesky(H + 1e-6 * torch.eye(H.shape[0]))

    n_cols = W.shape[1]
    for col in range(n_cols):
        # Quantize this column
        w = W[:, col].clone()
        scale = compute_group_scale(w, num_bits, group_size)
        w_q = quantize(w, scale, num_bits)
        error = W[:, col] - dequantize(w_q, scale)

        # Update remaining columns using Hessian information
        W[:, col] = dequantize(w_q, scale)
        if col + 1 < n_cols:
            W[:, col+1:] -= error.unsqueeze(1) * H_inv[col, col+1:].unsqueeze(0) / H_inv[col, col]

    return W  # Now contains dequantized quantized weights

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2024) takes a different approach based on a key observation: not all weights are equally important. Specifically, a small fraction of weight channels (roughly 1%) correspond to large activation magnitudes and are disproportionately important for model quality. Quantization errors in these “salient” channels cause much larger output errors.

AWQ works by applying per-channel scaling to the weights before quantization:

Identify salient channels: Run calibration data and measure the average magnitude of each input activation channel.
Compute optimal scales: For each channel, find a scale $s_i$ that minimizes the quantization error when applied as $W_{\text{scaled}} = W \cdot \text{diag}(s)$ and $X_{\text{scaled}} = X \cdot \text{diag}(s)^{-1}$ . The math is $Y = X W^T = (X \cdot \text{diag}(s)^{-1}) \cdot (\text{diag}(s) \cdot W)^T$ , so the product is preserved exactly, but the weight distribution is “smoothed” before quantization.
Quantize the scaled weights: Now quantize $W_{\text{scaled}}$ . The salient channels have been scaled up (making them more robust to rounding) while the less important channels have been scaled down.

The key difference from GPTQ: AWQ does not modify the remaining weights after quantizing each column. Instead, it finds a good pre-scaling that makes the entire weight matrix more quantization-friendly, then quantizes directly. This makes AWQ faster to run and sometimes more robust.

SqueezeLLM: Non-Uniform Quantization

SqueezeLLM (Kim et al., 2024) departs from uniform (linear) quantization entirely. It uses two innovations:

Non-uniform quantization: Instead of evenly spaced quantization levels, use k-means clustering to find the optimal placement of quantization levels for each weight group. This better matches the actual weight distribution, which is typically bell-shaped (not uniform).
Sparse outlier storage: Extract the most extreme weight values and store them separately in a sparse format (CSR). The remaining weights, with outliers removed, are much easier to quantize uniformly.

The combination allows SqueezeLLM to achieve INT3 quality that rivals other methods at INT4, but the non-uniform dequantization requires lookup tables, which can complicate kernel implementation.

Comparison: GPTQ vs. AWQ vs. RTN

📊

INT4 Weight Quantization Methods — Llama 2 7B WikiText-2 Perplexity

Method	Group Size	Perplexity	Quantization Time	Inference Speed
FP16 (baseline)	—	5.47	—	1.0x
RTN	128	6.29	< 1 min	1.0x (same kernels)
GPTQ	128	5.63	~5 min	1.0x (same kernels)
AWQ	128	5.60	~3 min	1.0x (same kernels)
SqueezeLLM	—	5.54	~20 min	0.8-0.9x (LUT overhead)

💡 Tip

Practical takeaway: GPTQ and AWQ produce comparable quality and use the same inference kernels. AWQ is slightly faster to quantize and often marginally better on quality. Both are far superior to naive RTN at INT4. For most production use cases, AWQ is the recommended default.

At INT4, inference speed for GPTQ and AWQ depends on the dequantization kernel, not the quantization method. Both produce the same format: INT4 weights with FP16 group scales. The actual speedup during inference comes from specialized kernels like Marlin (more on this in the production section).

Weight + Activation Quantization: W8A8 and Beyond

Weight-only quantization is simple and effective, but it leaves performance on the table. If you also quantize the activations, you can use integer GEMM kernels that run natively on tensor cores — avoiding the dequantize-then-FP16-GEMM overhead entirely. This is the domain of W8A8 (INT8 weights, INT8 activations) and FP8 quantization.

SmoothQuant: Making Activations Quantizable

As discussed above, the challenge is activation outliers. SmoothQuant (Xiao et al., 2023) solves this with a mathematically elegant trick: migrate the quantization difficulty from activations to weights using a per-channel scaling transformation.

For a linear layer computing $Y = X \cdot W^T$ , SmoothQuant introduces a diagonal scaling matrix $\text{diag}(s)$ :

$Y = X \cdot W^T = \underbrace{(X \cdot \text{diag}(s)^{-1})}_{\hat{X}} \cdot \underbrace{(\text{diag}(s) \cdot W)^T}_{\hat{W}^T}$

The scale vector $s$ is chosen per-channel to balance the quantization difficulty:

$s_j = \frac{\max(|X_j|)^\alpha}{\max(|W_j|)^{1-\alpha}}$

where $\alpha$ is a hyperparameter (typically 0.5) that controls how much difficulty is migrated. When $\alpha = 1$ , all difficulty goes to the weights; when $\alpha = 0$ , all difficulty stays in the activations. The sweet spot is usually $\alpha = 0.5$ , which equalizes the quantization ranges.

After smoothing:

$\hat{X}$ has smaller per-channel outliers (divided by $s$ ) — easier to quantize.
$\hat{W}$ has slightly larger values in the corresponding channels (multiplied by $s$ ) — but weights are well-behaved, so this is fine.

The scaling can be fused into the preceding LayerNorm or into the weight matrix directly, so there is zero runtime overhead. The result: both weights and activations can be quantized to INT8 with minimal quality loss, and the entire GEMM runs on INT8 tensor cores.

def smooth_layer(weight, activation_scales, alpha=0.5):
    """
    Apply SmoothQuant transformation to a linear layer.

    activation_scales: per-channel max absolute activation values
                       from calibration (shape: [in_features])
    """
    weight_scales = weight.abs().max(dim=0).values  # per-channel max of weights

    # Compute smoothing factor
    s = (activation_scales.pow(alpha) / weight_scales.pow(1 - alpha)).clamp(min=1e-5)

    # Apply: scale up weights, scale down activations (fused into LayerNorm)
    smoothed_weight = weight * s.unsqueeze(0)  # broadcast across output dim

    return smoothed_weight, s  # s is fused into preceding LayerNorm

FP8: The Hopper Generation Native Format

NVIDIA’s H100 (Hopper architecture) introduced native FP8 tensor core support, and it has become the preferred precision for high-throughput LLM inference. FP8 comes in two formats:

E4M3 (4-bit exponent, 3-bit mantissa): Dynamic range of $\pm 448$ , precision of about 1/8 of FP16. Used for forward pass (inference) because the limited range is sufficient for typical weight and activation distributions.

E5M2 (5-bit exponent, 2-bit mantissa): Dynamic range of $\pm 57344$ (same as FP16/BF16), but lower precision. Used for backward pass (training gradients) where the wider range prevents underflow in small gradients.

📊

FP8 Formats Compared

Format	Exponent Bits	Mantissa Bits	Dynamic Range	Primary Use
FP32	8	23	±3.4e38	Training (master weights)
BF16	8	7	±3.4e38	Training / Inference
FP16	5	10	±65504	Inference baseline
FP8 E5M2	5	2	±57344	Training (gradients)
FP8 E4M3	4	3	±448	Inference (weights + activations)

FP8 quantization is much simpler than INT8 because it is a floating-point format — you just need a single scale factor to shift the tensor’s dynamic range into the representable FP8 range. There are no zero points, no group decomposition, and the tensor core GEMM produces FP16 or FP32 output directly. NVIDIA’s Transformer Engine library handles the scaling automatically.

ℹ️ Note

Why FP8 is winning: FP8 E4M3 is essentially “free” quantization on H100. The tensor cores run FP8 GEMM at 2x the FLOPS of FP16 (3958 TFLOPS vs. 1979 TFLOPS on H100 SXM), the quality loss is negligible for most models, and the tooling (Transformer Engine) handles all the scaling automatically. If you are deploying on H100 or newer, FP8 should be your default inference precision.

INT8 GEMM on A100

On A100 (Ampere), there is no native FP8 support, but INT8 tensor cores deliver 624 TOPS vs. 312 TFLOPS for FP16 — a 2x compute throughput advantage. Combined with the 2x memory bandwidth savings from smaller weights, INT8 inference on A100 delivers roughly 2x end-to-end speedup over FP16 for decode-bound workloads.

The cuBLAS library provides cublasLtMatmul with INT8 input types, accumulating in INT32 before converting to the output type. The main implementation challenge is the quantization overhead: computing scales, quantizing activations on-the-fly, and dequantizing the INT32 accumulator. With SmoothQuant or static per-tensor activation scales, this overhead is minimal.

W4A4: The Frontier

W4A4 (4-bit weights and 4-bit activations) represents the frontier of quantization research. The potential is enormous: INT4 GEMM on hypothetical future hardware could deliver 4x the throughput of FP16. Current challenges:

Activation quantization at INT4 is destructive: Even with SmoothQuant, cramming activations into 16 levels (INT4) loses too much information for most models.
No mainstream hardware support: As of 2025, no production GPU has INT4 tensor cores.
Quality gap: W4A4 typically costs 2-5 perplexity points on 7B models, which is unacceptable for most applications.

Research directions like QuIP# (Chee et al., 2024) and AQLM use vector quantization and learned codebooks to push effective precision below 4 bits, but these require specialized kernels and are not yet mainstream.

KV Cache Quantization

The KV (key-value) cache stores the key and value projections for all past tokens during autoregressive generation. For long-context inference, the KV cache can become the dominant memory consumer — dwarfing even the model weights.

Example: Llama 2 70B with 128K context length in FP16:

Model weights: 140 GB
KV cache per sequence: $2 \times 80 \times 8192 \times 128 \times 128000 \times 2 \text{ bytes}$ — this is roughly 40 GB per sequence at FP16

KV cache quantization is orthogonal to weight quantization — you can (and should) apply both independently.

Per-Token KV Quantization

The simplest approach: compute a scale factor for each token’s key and value vectors independently. Because each token’s KV vectors are relatively short (hidden_dim / num_heads per head), per-token quantization is fine-grained enough to work well.

def quantize_kv_cache(key_states, value_states, num_bits=8):
    """
    Quantize KV cache per-token (per the last dimension grouping).
    key_states: (batch, num_heads, seq_len, head_dim)
    """
    # Compute per-token scales (one scale per token per head)
    k_scales = key_states.abs().amax(dim=-1, keepdim=True) / (2 ** (num_bits - 1) - 1)
    v_scales = value_states.abs().amax(dim=-1, keepdim=True) / (2 ** (num_bits - 1) - 1)

    k_quantized = (key_states / k_scales).round().clamp(-128, 127).to(torch.int8)
    v_quantized = (value_states / v_scales).round().clamp(-128, 127).to(torch.int8)

    return k_quantized, v_quantized, k_scales, v_scales

FP8 KV Cache

FP8 (E4M3) KV cache is the simplest and most effective option on H100:

2x memory savings compared to FP16 (no additional scale storage needed — FP8 is self-describing).
Minimal quality loss: Typically <0.1 perplexity points on standard benchmarks.
No specialized kernels needed: Attention kernels like FlashAttention-3 support FP8 natively.

INT4 KV Cache

INT4 KV cache is more aggressive:

4x memory savings compared to FP16 — enabling 4x longer context or 4x more concurrent sequences.
Noticeable quality impact: 0.2-0.5 perplexity points, with occasional degradation on tasks requiring precise recall of early context.
Per-token or per-group scales required: Adds storage overhead (FP16 scales).

📊

KV Cache Quantization — Llama 2 70B, 4K Context

KV Precision	KV Cache Size	Perplexity Delta	Max Context (80 GB budget)
FP16	1.0x (baseline)	+0.00	~32K tokens
FP8 (E4M3)	0.50x	+0.05	~64K tokens
INT8	0.50x + scales	+0.08	~60K tokens
INT4	0.25x + scales	+0.30	~100K+ tokens

💡 Tip

Recommendation: If your hardware supports FP8 (H100, MI300X), use FP8 KV cache — it is nearly free. On older hardware, INT8 KV cache with per-token scales is the safest choice. INT4 KV cache is worth it only when you are severely memory-constrained for long-context workloads.

Hardware Support Matrix

Not all GPUs support all quantization formats. The mapping between quantization methods and hardware is critical for choosing the right approach.

📊

GPU Hardware Quantization Support

GPU	Architecture	INT8 Tensor Cores	FP8 Tensor Cores	Best Quantization Strategy
A100 80GB	Ampere	Yes (624 TOPS)	No	W8A8 (SmoothQuant) or W4A16 (GPTQ/AWQ)
H100 80GB	Hopper	Yes	Yes (E4M3/E5M2, 3958 TFLOPS)	FP8 (W8A8) via Transformer Engine
H200 141GB	Hopper	Yes	Yes (E4M3/E5M2)	FP8 + larger KV cache budget
MI300X 192GB	CDNA3	Yes	Yes (E4M3/E5M2)	FP8 (W8A8) via ROCm
L40S 48GB	Ada Lovelace	Yes	Yes (E4M3/E5M2)	FP8 or INT8
RTX 4090 24GB	Ada Lovelace	Yes	Yes (E4M3/E5M2)	W4A16 for large models (limited VRAM)

The strategic implications:

A100: No FP8, so your best options are W8A8 with SmoothQuant (using INT8 tensor cores) or W4A16 with GPTQ/AWQ (dequantize to FP16, use FP16 tensor cores). For decode-bound workloads, W4A16 often wins because the bandwidth savings outweigh the lack of INT4 tensor cores.
H100/H200: FP8 is the clear winner. The Transformer Engine handles scaling automatically, FP8 tensor cores deliver 2x FLOPS over FP16, and quality loss is negligible. W4A16 is still useful when you need to fit very large models into limited GPU count.
MI300X: AMD’s CDNA3 supports FP8 (E4M3/E5M2) and INT8. The software ecosystem (ROCm, hipBLAS) is catching up, with vLLM providing good support. The 192 GB HBM3 capacity means fewer quantization compromises are needed.
Consumer GPUs (RTX 4090): With only 24 GB VRAM, aggressive quantization (W4A16) is essential for running 70B+ models. FP8 is supported but the small memory makes W4A16 the practical choice.

Tensor Core Throughput by Precision — H100 SXM

Metric	FP32	TF32	BF16/FP16	FP8 (E4M3)	INT8
TFLOPS / TOPS	67	989	1979	3958	3958

Quality Impact: Real Numbers

Theory is important, but what actually happens to model quality? Here are representative numbers from published papers and community benchmarks.

Perplexity on WikiText-2

📊

WikiText-2 Perplexity — Lower is Better

Model	FP16	INT8 (W8A8 SmoothQuant)	INT4 GPTQ (g128)	INT4 AWQ (g128)	INT4 RTN (g128)
Llama 2 7B	5.47	5.51	5.63	5.60	6.29
Llama 2 13B	4.88	4.90	4.97	4.95	5.20
Llama 2 70B	3.32	3.33	3.40	3.38	3.75

Key observations:

INT8 (W8A8) is nearly lossless: Across all model sizes, INT8 with SmoothQuant adds <0.05 perplexity points. This is within noise range and is effectively free quality-wise.
INT4 with GPTQ/AWQ is good but not free: The cost is 0.06-0.16 perplexity points on large models, which is acceptable for most applications.
INT4 RTN is significantly worse: Up to 0.43 perplexity points on 70B, demonstrating why naive quantization fails.
Larger models quantize better: The quality gap between FP16 and INT4 shrinks as model size increases (0.13 for 70B vs. 0.16 for 7B with AWQ).

Downstream Task Accuracy

Perplexity is a proxy metric. What about actual task performance?

📊

MMLU (5-shot) — Higher is Better

Model	FP16	FP8 (E4M3)	INT8 (SmoothQuant)	INT4 (AWQ g128)
Llama 2 7B	45.3%	45.2%	45.0%	44.5%
Llama 2 13B	54.8%	54.7%	54.6%	54.0%
Llama 2 70B	68.9%	68.8%	68.7%	68.0%

📊

HumanEval (pass@1) — Higher is Better

Model	FP16	FP8 (E4M3)	INT8 (SmoothQuant)	INT4 (AWQ g128)
Llama 2 7B	12.8%	12.8%	12.2%	11.6%
Llama 2 13B	18.3%	18.3%	18.0%	17.1%
Llama 2 70B	29.9%	29.9%	29.6%	28.7%

ℹ️ Note

Rule of thumb for production: INT8 and FP8 are safe for virtually all use cases — expect <0.5% accuracy drop on downstream tasks. INT4 (GPTQ/AWQ) is safe for 13B+ models with <1-2% accuracy drop. For 7B models and below at INT4, benchmark on your specific task before deploying.

Perplexity vs. Model Size at INT4

The relationship between model size and quantization robustness deserves emphasis:

Perplexity Degradation from INT4 AWQ (g128) vs. Model Size

Metric	1.1B	3B	7B	13B	33B	70B
Perplexity increase over FP16	0.85	0.41	0.13	0.07	0.08	0.06

Below about 3B parameters, INT4 quantization starts causing significant degradation. This is because smaller models have less redundancy — each weight carries more information, and the quantization error budget is tighter.

When NOT to Quantize

Quantization is powerful but not always appropriate. Here are the cases where you should think twice.

Small Models (<3B Parameters)

As the chart above shows, small models suffer disproportionately from quantization. A 1.1B model at INT4 may lose nearly a full perplexity point — equivalent to using a model trained on half the data. For models this small, the memory savings (from ~2 GB to ~0.5 GB) are rarely worth the quality cost, since they already fit comfortably in a single GPU.

If you must quantize a small model, stick to INT8 or FP8, which remain nearly lossless even at small scales.

Training and Fine-Tuning

Integer quantization is not used during training. The standard for training is mixed precision with BF16 (for compute) and FP32 (for master weights and optimizer states). The reason: training requires gradients, and gradients need:

High dynamic range: Gradients can span many orders of magnitude.
Precise accumulation: Small gradient updates must not be rounded away.
Symmetric handling of positive and negative values: Integer formats are awkward here.

The one exception is FP8 training on H100, where E5M2 is used for gradient GEMM and E4M3 for forward-pass GEMM, with FP32 master weights. This is a form of quantized training, but it uses floating-point quantization, not integer.

⚠️ Warning

Quantization-aware training (QAT) — where you simulate quantization during training and let the model adapt — can improve quantized model quality, but it requires a full training run and is rarely used for LLMs due to the enormous compute cost. Post-training quantization (PTQ) methods like GPTQ and AWQ are the standard.

Activations with Extreme Outliers

Some model architectures (particularly older ones without proper normalization, or models with very deep residual paths) develop activation outliers so extreme that even SmoothQuant cannot fully tame them. In these cases, W8A8 quantization may cause unacceptable quality loss, and you should fall back to weight-only quantization (W4A16 or W8A16) where activations remain in FP16.

Fortunately, most modern LLM architectures (Llama, Mistral, Qwen, etc.) work well with SmoothQuant or FP8 quantization.

When Latency Is Not the Goal

If you are running batch inference (large batches, high utilization), the workload may be compute-bound rather than memory-bandwidth-bound. In this regime, the decode bottleneck shifts from reading weights to performing the matrix multiplications, and quantization provides less benefit. FP8 or W8A8 with INT8 tensor cores still helps (2x compute throughput), but W4A16 provides less benefit because you still run the FP16 GEMM — you just save memory.

Production Deployment

Theory and benchmarks are great, but what does it look like to actually deploy quantized models? The two dominant inference frameworks — vLLM and TensorRT-LLM — each handle quantization differently.

vLLM

vLLM is the most popular open-source LLM inference framework, with built-in support for multiple quantization methods:

GPTQ/AWQ (W4A16): Load pre-quantized models directly. vLLM includes optimized dequantization kernels and the Marlin kernel for W4A16 GEMM.

FP8 (W8A8): Native support on H100. vLLM can quantize models to FP8 on-the-fly or load pre-quantized FP8 checkpoints. Uses cuBLAS FP8 GEMM.

SmoothQuant INT8 (W8A8): Supported via pre-quantized models from frameworks like AutoSmoothQuant.

FP8 KV Cache: Enabled with a flag (--kv-cache-dtype fp8_e4m3), providing 2x KV cache memory savings with negligible quality impact.

# Launch vLLM with AWQ quantized model
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-70B-AWQ \
    --quantization awq \
    --dtype float16 \
    --tensor-parallel-size 2 \
    --max-model-len 4096

# Launch with FP8 on H100
python -m vllm.entrypoints.openai.api_server \
    --model neuralmagic/Meta-Llama-3-70B-FP8 \
    --dtype float16 \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 2

The Marlin Kernel: W4A16 Done Right

Marlin (IST Austria, 2024) is a GPU kernel specifically designed for W4A16 GEMM (4-bit quantized weights, FP16 activations and output). It achieves near-ideal ( $\sim$ 4x) speedup over FP16 GEMM on A100 by:

Fusing dequantization with GEMM: Instead of dequantizing the entire weight matrix to FP16 and then calling cuBLAS, Marlin dequantizes small tiles of weights in shared memory immediately before the tensor core GEMM.
Asynchronous memory access: Overlaps global memory loads with compute, keeping the tensor cores fed.
Optimal tile sizes: Carefully tuned for the A100 memory hierarchy (L2 cache, shared memory, register file).

The result: Marlin achieves ~3.8x speedup over FP16 cuBLAS for decode (batch size 1) on A100, close to the theoretical 4x from reading 4x fewer bytes. This kernel is integrated into vLLM and is the reason W4A16 is so effective in practice.

📊

Marlin W4A16 Performance vs. FP16 cuBLAS — A100 80GB

Matrix Size (M x N x K)	FP16 (us)	W4A16 Marlin (us)	Speedup
1 x 4096 x 4096	15.2	4.1	3.71x
1 x 8192 x 8192	52.3	14.0	3.74x
1 x 11008 x 4096	22.4	6.1	3.67x
16 x 4096 x 4096	16.8	5.9	2.85x
64 x 4096 x 4096	19.1	11.2	1.71x

Notice how the speedup decreases at larger batch sizes (M=16, M=64): as the workload shifts from memory-bound to compute-bound, the bandwidth savings from smaller weights matter less. At batch size 1, Marlin is nearly at the theoretical limit. This is exactly the decode scenario.

TensorRT-LLM

NVIDIA’s TensorRT-LLM provides tightly optimized quantization support with deeper hardware integration:

FP8: First-class support with automatic scaling via Transformer Engine. The quantization is applied during the TRT engine build process.

SmoothQuant INT8 (W8A8): Built-in SmoothQuant implementation with per-channel weight scales and per-tensor activation scales.

INT4 AWQ/GPTQ: Supported with optimized CUTLASS kernels.

INT4 KV Cache: Supported alongside FP8 KV cache.

TensorRT-LLM typically achieves 10-20% higher throughput than vLLM for a given quantization method due to deeper kernel fusion (e.g., fusing quantization into the preceding operation, fusing bias addition with dequantization), but at the cost of less flexibility and longer model compilation times.

# TensorRT-LLM: Build a quantized engine (simplified)
from tensorrt_llm import Builder
from tensorrt_llm.quantization import QuantMode

builder = Builder()
quant_mode = QuantMode.use_smooth_quant(per_token=True, per_channel=True)
# or: QuantMode.use_weight_only(use_int4_weights=True)
# or: QuantMode.from_description(quantize_weights=True, quantize_activations=True,
#                                  per_token=True, per_channel=True, use_fp8=True)

engine = builder.build_engine(
    model_config=config,
    quant_mode=quant_mode,
    max_batch_size=64,
    max_input_len=2048,
    max_output_len=512,
)

End-to-End Deployment Decision Tree

Choosing the right quantization strategy depends on your hardware, model size, and quality requirements. Here is a practical decision tree:

On H100/H200:

Start with FP8 (W8A8) + FP8 KV cache. This is the default and nearly free.
If the model does not fit in memory at FP8, use W4A16 (AWQ) + FP8 KV cache.
If you need maximum throughput and have verified quality, combine W4A16 with INT4 KV cache.

On A100:

Start with W4A16 (AWQ) + Marlin kernel for best single-stream decode latency.
If quality is paramount or batch sizes are large, use W8A8 (SmoothQuant) with INT8 tensor cores.
For very long context, add INT8 KV cache quantization.

On consumer GPUs (RTX 3090/4090):

W4A16 (AWQ or GPTQ) is the only practical option for 70B+ models.
For 7B-13B models, W4A16 or even W8A16 may fit.
Use INT4 KV cache if you need long context.

End-to-End Throughput Comparison — Llama 2 70B, 1x A100 80GB

Metric	FP16 (2x A100)	W8A8 SmoothQuant	W4A16 AWQ (Marlin)	W4A16 AWQ + INT8 KV
Tokens/sec (decode, batch=1)	18	34	52	55

Putting It All Together

Quantization for LLM inference is a mature and essential technique in 2025. The landscape has settled into clear best practices:

📊

Summary: Recommended Quantization by Scenario

Scenario	Weight Precision	Activation Precision	KV Cache	Method
H100, quality-critical	FP8 (E4M3)	FP8 (E4M3)	FP8	Transformer Engine auto-quant
H100, max throughput	INT4	FP16	FP8	AWQ + Marlin + FP8 KV
A100, balanced	INT4	FP16	INT8	AWQ + Marlin + INT8 KV
A100, quality-critical	INT8	INT8	FP16	SmoothQuant W8A8
Consumer GPU, large model	INT4	FP16	INT4	AWQ/GPTQ + INT4 KV
Small model (< 3B)	FP16 or FP8	FP16 or FP8	FP16	Minimal or no quantization

The key principles to remember:

LLM decode is memory-bandwidth-bound. Every byte saved in weights and KV cache translates nearly linearly to faster inference. This is the fundamental reason quantization works so well.
INT8 and FP8 are nearly lossless. For any model above 7B parameters, INT8/FP8 quantization costs <0.1 perplexity points and <0.5% on downstream tasks. There is almost no reason not to use it.
INT4 weight-only quantization is mature and practical. AWQ and GPTQ with group size 128 produce high-quality INT4 models for 7B+ parameter models. Combined with the Marlin kernel, W4A16 delivers near-4x decode speedup on A100.
KV cache quantization is the next frontier. As context lengths grow to 128K+ tokens, KV cache memory dominates. FP8 KV cache is nearly free; INT4 KV cache enables dramatically longer contexts at a small quality cost.
Hardware determines strategy. FP8 on H100 is the easy button. INT8 SmoothQuant or W4A16 on A100 requires more careful choice. Always check your hardware’s tensor core support before choosing a quantization format.
Always benchmark on your specific task. Perplexity is a useful proxy, but production quality depends on your application. A 0.1 perplexity point difference might be invisible on summarization but noticeable on code generation. Quantize, evaluate, then deploy.

Quantization is no longer an exotic optimization — it is table stakes for cost-effective LLM deployment. The tools are mature, the quality impact is well-understood, and the speedups are real. The only question is which method to use, and this post should help you answer that.

Why Quantization Matters for LLMs

Memory Footprint by Precision — Llama 2 70B

Theoretical Decode Latency vs. Precision (70B Model, A100 80GB)

Quantization Fundamentals

Linear (Uniform) Quantization

Symmetric vs. Asymmetric Quantization

Granularity: Per-Tensor vs. Per-Channel vs. Per-Group

Quantization Granularity Tradeoffs

Why LLMs Are Hard to Quantize: The Outlier Problem

Weight-Only Quantization: W4A16 and W8A16

Round-to-Nearest (RTN)

GPTQ: One-Shot Post-Training Quantization

AWQ: Activation-Aware Weight Quantization

SqueezeLLM: Non-Uniform Quantization

Comparison: GPTQ vs. AWQ vs. RTN

INT4 Weight Quantization Methods — Llama 2 7B WikiText-2 Perplexity

Weight + Activation Quantization: W8A8 and Beyond

SmoothQuant: Making Activations Quantizable

FP8: The Hopper Generation Native Format

FP8 Formats Compared

INT8 GEMM on A100

W4A4: The Frontier

KV Cache Quantization

Per-Token KV Quantization

FP8 KV Cache

INT4 KV Cache

KV Cache Quantization — Llama 2 70B, 4K Context

Hardware Support Matrix

GPU Hardware Quantization Support

Tensor Core Throughput by Precision — H100 SXM

Quality Impact: Real Numbers

Perplexity on WikiText-2

WikiText-2 Perplexity — Lower is Better

Downstream Task Accuracy

MMLU (5-shot) — Higher is Better

HumanEval (pass@1) — Higher is Better

Perplexity vs. Model Size at INT4

Perplexity Degradation from INT4 AWQ (g128) vs. Model Size

When NOT to Quantize

Small Models (<3B Parameters)

Training and Fine-Tuning

Activations with Extreme Outliers

When Latency Is Not the Goal

Production Deployment

vLLM

The Marlin Kernel: W4A16 Done Right

Marlin W4A16 Performance vs. FP16 cuBLAS — A100 80GB

TensorRT-LLM

End-to-End Deployment Decision Tree

End-to-End Throughput Comparison — Llama 2 70B, 1x A100 80GB

Putting It All Together

Summary: Recommended Quantization by Scenario

Stanley Phoong

Related Posts

KV Cache: The Hidden Memory Giant in LLM Serving

LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide

LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It