Part of Series Quantization Masterclass 11 of 30
1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest โ€” Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier โ€” Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 โ€” How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization โ€” The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization โ€” OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment

Quantization maps continuous floating-point values to a discrete set of integers. The mapping requires a scale factor (and optionally a zero-point) that defines how the integer grid maps back to the real-valued range. The critical design decision is the granularity at which you compute this scale factor: once for the entire tensor, once per output channel, or once per group of weights within a channel.

This granularity choice determines how much quantization error you accumulate, how much metadata overhead you pay, and whether outlier channels can corrupt the quantization of well-behaved channels. This post implements all three granularities from scratch, analyzes the error characteristics mathematically, benchmarks perplexity on Llama-2 7B, and provides concrete guidance on group size selection.

The Quantization Mapping

For symmetric quantization to bb bits, the mapping from floating-point weight ww to integer qq is:

q=clamp(round(ws),โ€…โ€Šโˆ’2bโˆ’1,โ€…โ€Š2bโˆ’1โˆ’1)q = \text{clamp}\left(\text{round}\left(\frac{w}{s}\right),\; -2^{b-1},\; 2^{b-1} - 1\right)

where ss is the scale factor. The dequantized value is:

w^=qโ‹…s\hat{w} = q \cdot s

The quantization error for a single weight is:

ฯต=wโˆ’w^=wโˆ’sโ‹…round(w/s)\epsilon = w - \hat{w} = w - s \cdot \text{round}(w / s)

The maximum possible error is s/2s/2 (half the step size). The scale factor ss is computed as:

s=maxโก(โˆฃwโˆฃ)2bโˆ’1โˆ’1s = \frac{\max(|w|)}{2^{b-1} - 1}

For asymmetric quantization, we also compute a zero-point:

s=maxโก(w)โˆ’minโก(w)2bโˆ’1,z=round(โˆ’minโก(w)s)s = \frac{\max(w) - \min(w)}{2^b - 1}, \quad z = \text{round}\left(\frac{-\min(w)}{s}\right)

The granularity question is: over which set of weights do we compute maxโก(โˆฃwโˆฃ)\max(|w|)?

Per-Tensor Scaling

Per-tensor scaling computes a single scale factor for the entire weight matrix WโˆˆRCoutร—CinW \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}}:

stensor=maxโกi,j(โˆฃWi,jโˆฃ)2bโˆ’1โˆ’1s_{\text{tensor}} = \frac{\max_{i,j}(|W_{i,j}|)}{2^{b-1} - 1}

One scale factor, one zero-point (if asymmetric). Minimal metadata overhead.

import numpy as np

def quantize_per_tensor_symmetric(weights, bits):
    """Per-tensor symmetric quantization.

    Args:
        weights: numpy array of shape (C_out, C_in)
        bits: number of quantization bits

    Returns:
        quantized: integer array of same shape
        scale: single float
    """
    qmax = 2 ** (bits - 1) - 1
    qmin = -(2 ** (bits - 1))

    abs_max = np.max(np.abs(weights))
    scale = abs_max / qmax if abs_max > 0 else 1.0

    quantized = np.clip(np.round(weights / scale), qmin, qmax).astype(np.int8)
    return quantized, scale

def dequantize_per_tensor(quantized, scale):
    return quantized.astype(np.float32) * scale

# Example: quantize a (4096, 4096) weight matrix
np.random.seed(42)
W = np.random.randn(4096, 4096).astype(np.float32) * 0.02

q_tensor, s_tensor = quantize_per_tensor_symmetric(W, bits=8)
W_hat_tensor = dequantize_per_tensor(q_tensor, s_tensor)

mse_tensor = np.mean((W - W_hat_tensor) ** 2)
print(f"Per-tensor INT8 MSE: {mse_tensor:.2e}")
# Per-tensor INT8 MSE: ~5.1e-09

The problem with per-tensor scaling emerges when different rows of the weight matrix have very different magnitudes. If row 0 has maxโก(โˆฃwโˆฃ)=0.01\max(|w|) = 0.01 and row 1000 has maxโก(โˆฃwโˆฃ)=1.0\max(|w|) = 1.0, the scale factor is set by row 1000. Row 0 can only use โˆผ1%\sim 1\% of the integer range, wasting โˆผ99%\sim 99\% of the representable levels.

Per-Channel Scaling

Per-channel scaling computes one scale factor per output channel (row of the weight matrix):

si=maxโกj(โˆฃWi,jโˆฃ)2bโˆ’1โˆ’1,i=0,โ€ฆ,Coutโˆ’1s_i = \frac{\max_j(|W_{i,j}|)}{2^{b-1} - 1}, \quad i = 0, \ldots, C_{\text{out}} - 1

This yields CoutC_{\text{out}} scale factors. Each row uses its full integer range independently.

def quantize_per_channel_symmetric(weights, bits):
    """Per-channel (per-row) symmetric quantization.

    Args:
        weights: numpy array of shape (C_out, C_in)
        bits: number of quantization bits

    Returns:
        quantized: integer array of same shape
        scales: array of shape (C_out,)
    """
    qmax = 2 ** (bits - 1) - 1
    qmin = -(2 ** (bits - 1))

    # One scale per row
    abs_max = np.max(np.abs(weights), axis=1)  # shape: (C_out,)
    scales = abs_max / qmax
    scales = np.where(scales > 0, scales, 1.0)

    # Broadcast: weights (C_out, C_in) / scales (C_out, 1)
    quantized = np.clip(
        np.round(weights / scales[:, np.newaxis]),
        qmin, qmax
    ).astype(np.int8)

    return quantized, scales

def dequantize_per_channel(quantized, scales):
    return quantized.astype(np.float32) * scales[:, np.newaxis]

q_channel, s_channel = quantize_per_channel_symmetric(W, bits=8)
W_hat_channel = dequantize_per_channel(q_channel, s_channel)

mse_channel = np.mean((W - W_hat_channel) ** 2)
print(f"Per-channel INT8 MSE: {mse_channel:.2e}")
# Per-channel INT8 MSE: ~4.8e-09

For Gaussian-distributed weights, per-channel and per-tensor give similar results because every row has a similar distribution. The difference appears when rows have heterogeneous scales, which happens in real models due to the outlier channel phenomenon (covered in Post 12).

Metadata overhead: For a (4096ร—4096)(4096 \times 4096) weight matrix quantized to INT8, the weights themselves occupy 4096ร—4096=164096 \times 4096 = 16 MB (at 1 byte each). Per-channel scales add 4096ร—2=84096 \times 2 = 8 KB (FP16 scales). That is 0.05%0.05\% overhead โ€” negligible.

Per-Group Scaling

Per-group scaling subdivides each row into groups of gg contiguous elements and computes one scale factor per group:

si,k=maxโกjโˆˆ[kg,(k+1)g)(โˆฃWi,jโˆฃ)2bโˆ’1โˆ’1s_{i,k} = \frac{\max_{j \in [kg, (k+1)g)}(|W_{i,j}|)}{2^{b-1} - 1}

where kk indexes the group within row ii. This yields Coutร—โŒˆCin/gโŒ‰C_{\text{out}} \times \lceil C_{\text{in}} / g \rceil scale factors.

def quantize_per_group_symmetric(weights, bits, group_size):
    """Per-group symmetric quantization.

    Args:
        weights: numpy array of shape (C_out, C_in)
        bits: number of quantization bits
        group_size: number of elements per group

    Returns:
        quantized: integer array of same shape
        scales: array of shape (C_out, num_groups)
    """
    C_out, C_in = weights.shape
    qmax = 2 ** (bits - 1) - 1
    qmin = -(2 ** (bits - 1))

    # Pad if C_in not divisible by group_size
    num_groups = (C_in + group_size - 1) // group_size
    padded_C_in = num_groups * group_size

    if padded_C_in > C_in:
        weights_padded = np.zeros((C_out, padded_C_in), dtype=weights.dtype)
        weights_padded[:, :C_in] = weights
    else:
        weights_padded = weights

    # Reshape to (C_out, num_groups, group_size)
    reshaped = weights_padded.reshape(C_out, num_groups, group_size)

    # Compute per-group scales
    abs_max = np.max(np.abs(reshaped), axis=2)  # (C_out, num_groups)
    scales = abs_max / qmax
    scales = np.where(scales > 0, scales, 1.0)

    # Quantize
    quantized = np.clip(
        np.round(reshaped / scales[:, :, np.newaxis]),
        qmin, qmax
    ).astype(np.int8)

    # Reshape back, remove padding
    quantized = quantized.reshape(C_out, padded_C_in)[:, :C_in]

    return quantized, scales

def dequantize_per_group(quantized, scales, group_size):
    C_out, C_in = quantized.shape
    num_groups = scales.shape[1]
    padded_C_in = num_groups * group_size

    if padded_C_in > C_in:
        q_padded = np.zeros((C_out, padded_C_in), dtype=quantized.dtype)
        q_padded[:, :C_in] = quantized
    else:
        q_padded = quantized

    reshaped = q_padded.reshape(C_out, num_groups, group_size)
    dequantized = reshaped.astype(np.float32) * scales[:, :, np.newaxis]

    return dequantized.reshape(C_out, padded_C_in)[:, :C_in]

# Compare group sizes
for gs in [32, 64, 128, 256]:
    q_group, s_group = quantize_per_group_symmetric(W, bits=4, group_size=gs)
    W_hat_group = dequantize_per_group(q_group, s_group, gs)
    mse = np.mean((W - W_hat_group) ** 2)
    num_scales = s_group.size
    overhead_bytes = num_scales * 2  # FP16 scales
    weight_bytes = W.size // 2  # 4-bit = 0.5 bytes per element
    overhead_pct = overhead_bytes / weight_bytes * 100
    print(f"  group_size={gs:3d}: MSE={mse:.2e}, "
          f"scales={num_scales:,}, overhead={overhead_pct:.1f}%")

Expected output for Gaussian weights:

  group_size= 32: MSE=5.2e-06, scales=524,288, overhead=12.5%
  group_size= 64: MSE=6.1e-06, scales=262,144, overhead=6.2%
  group_size=128: MSE=7.5e-06, scales=131,072, overhead=3.1%
  group_size=256: MSE=9.2e-06, scales=65,536, overhead=1.6%
โšก The Group Size Trade-off

Smaller group size = lower quantization error (each group fits its own local range) but higher metadata overhead (more scale factors to store and fetch during inference). At group_size=32 with INT4 weights, the scale factors themselves consume 12.5% additional memory โ€” significant enough to partially offset the compression benefit.

Mathematical Error Analysis

Let us derive the expected quantization error for each granularity under the assumption that weights are drawn from N(0,ฯƒ2)\mathcal{N}(0, \sigma^2).

Per-Tensor Error

For per-tensor scaling, the scale factor is set by the global maximum:

s=maxโกi,j(โˆฃWi,jโˆฃ)2bโˆ’1โˆ’1s = \frac{\max_{i,j}(|W_{i,j}|)}{2^{b-1} - 1}

For nn i.i.d. samples from N(0,ฯƒ2)\mathcal{N}(0, \sigma^2), the expected maximum of โˆฃWโˆฃ|W| is approximately:

E[maxโกโˆฃWโˆฃ]โ‰ˆฯƒ2lnโกn\mathbb{E}[\max |W|] \approx \sigma \sqrt{2 \ln n}

For a (4096ร—4096)(4096 \times 4096) matrix with n=16,777,216n = 16{,}777{,}216 elements:

E[maxโกโˆฃWโˆฃ]โ‰ˆฯƒ2lnโก(16,777,216)โ‰ˆ5.87ฯƒ\mathbb{E}[\max |W|] \approx \sigma \sqrt{2 \ln(16{,}777{,}216)} \approx 5.87\sigma

The quantization step size is:

ฮ”tensor=s1=5.87ฯƒ2bโˆ’1โˆ’1\Delta_{\text{tensor}} = \frac{s}{1} = \frac{5.87\sigma}{2^{b-1} - 1}

For uniform quantization error, the MSE per weight is ฮ”2/12\Delta^2 / 12:

MSEtensor=112(5.87ฯƒ2bโˆ’1โˆ’1)2\text{MSE}_{\text{tensor}} = \frac{1}{12}\left(\frac{5.87\sigma}{2^{b-1}-1}\right)^2

Per-Channel Error

Each row has Cin=4096C_{\text{in}} = 4096 elements. The expected row maximum is:

E[maxโกjโˆฃWi,jโˆฃ]โ‰ˆฯƒ2lnโก(4096)โ‰ˆ4.07ฯƒ\mathbb{E}[\max_j |W_{i,j}|] \approx \sigma \sqrt{2 \ln(4096)} \approx 4.07\sigma

MSEchannel=112(4.07ฯƒ2bโˆ’1โˆ’1)2\text{MSE}_{\text{channel}} = \frac{1}{12}\left(\frac{4.07\sigma}{2^{b-1}-1}\right)^2

Per-Group Error

For group size gg:

E[maxโกโˆฃwโˆฃย inย group]โ‰ˆฯƒ2lnโกg\mathbb{E}[\max |w| \text{ in group}] \approx \sigma \sqrt{2 \ln g}

MSEgroup=112(ฯƒ2lnโกg2bโˆ’1โˆ’1)2\text{MSE}_{\text{group}} = \frac{1}{12}\left(\frac{\sigma \sqrt{2 \ln g}}{2^{b-1}-1}\right)^2

Error Ratio

The ratio of per-tensor to per-group MSE is:

MSEtensorMSEgroup=lnโกnlnโกg\frac{\text{MSE}_{\text{tensor}}}{\text{MSE}_{\text{group}}} = \frac{\ln n}{\ln g}

For n=16,777,216n = 16{,}777{,}216 and g=128g = 128: ratio =lnโก(16,777,216)/lnโก(128)=16.6/4.85=3.4ร—= \ln(16{,}777{,}216) / \ln(128) = 16.6 / 4.85 = 3.4\times.

This means per-group quantization with g=128g=128 has 3.4ร—3.4\times lower MSE than per-tensor on a (4096ร—4096)(4096 \times 4096) matrix, purely from the reduced max-tracking scope.

def theoretical_mse_ratio(n_total, group_size):
    """Compute theoretical MSE ratio: per-tensor / per-group."""
    return np.log(n_total) / np.log(group_size)

n = 4096 * 4096
for g in [32, 64, 128, 256, 4096]:
    ratio = theoretical_mse_ratio(n, g)
    print(f"  group_size={g:>5d}: tensor/group MSE ratio = {ratio:.2f}x")
  group_size=   32: tensor/group MSE ratio = 4.81x
  group_size=   64: tensor/group MSE ratio = 3.99x
  group_size=  128: tensor/group MSE ratio = 3.42x
  group_size=  256: tensor/group MSE ratio = 2.99x
  group_size= 4096: tensor/group MSE ratio = 2.00x  (per-channel)

The Outlier Amplification Effect

The theoretical analysis above assumes Gaussian weights. Real LLM weights deviate from this because of outlier channels โ€” a small number of channels with magnitudes 10-100x larger than typical. Under outliers, the granularity choice matters far more than the Gaussian analysis suggests.

def simulate_outlier_impact(C_out, C_in, bits, outlier_fraction, outlier_magnitude):
    """Simulate quantization with outlier channels."""
    W = np.random.randn(C_out, C_in).astype(np.float32) * 0.02

    # Inject outliers into a fraction of channels
    num_outlier_channels = int(C_in * outlier_fraction)
    outlier_cols = np.random.choice(C_in, num_outlier_channels, replace=False)
    W[:, outlier_cols] *= outlier_magnitude

    results = {}

    # Per-tensor
    q, s = quantize_per_tensor_symmetric(W, bits)
    W_hat = dequantize_per_tensor(q, s)
    results['per_tensor'] = np.mean((W - W_hat) ** 2)

    # Per-channel
    q, s = quantize_per_channel_symmetric(W, bits)
    W_hat = dequantize_per_channel(q, s)
    results['per_channel'] = np.mean((W - W_hat) ** 2)

    # Per-group g=128
    q, s = quantize_per_group_symmetric(W, bits, group_size=128)
    W_hat = dequantize_per_group(q, s, group_size=128)
    results['per_group_128'] = np.mean((W - W_hat) ** 2)

    return results

# 1% of input channels are 50x larger
results_outlier = simulate_outlier_impact(
    C_out=4096, C_in=4096, bits=4,
    outlier_fraction=0.01, outlier_magnitude=50.0
)

for method, mse in results_outlier.items():
    print(f"  {method:>15s}: MSE = {mse:.2e}")

Expected output:

    per_tensor: MSE = 2.4e-03
   per_channel: MSE = 2.4e-03  (outliers affect ALL rows)
 per_group_128: MSE = 3.1e-04  (outliers isolated to their groups)
โš ๏ธ Per-Channel Does Not Isolate Input-Dimension Outliers

Per-channel scaling computes one scale per output channel (row). If the outliers are in specific input channels (columns), every row sees the same outlier columns, and per-channel scaling cannot help. Per-group scaling isolates the outlier columns into specific groups, limiting their impact. This is why GPTQ and AWQ use per-group quantization with group_size=128 as the default.

Memory Overhead Accounting

For a weight matrix WโˆˆRCoutร—CinW \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}}} quantized to bb bits with group size gg:

Weight storage: Coutร—Cinร—b/8C_{\text{out}} \times C_{\text{in}} \times b / 8 bytes

Scale storage (FP16): Coutร—โŒˆCin/gโŒ‰ร—2C_{\text{out}} \times \lceil C_{\text{in}} / g \rceil \times 2 bytes

Zero-point storage (if asymmetric, INT8): Coutร—โŒˆCin/gโŒ‰ร—1C_{\text{out}} \times \lceil C_{\text{in}} / g \rceil \times 1 byte

Effective bits per weight:

beff=b+16g(symmetric,ย FP16ย scales)b_{\text{eff}} = b + \frac{16}{g} \quad \text{(symmetric, FP16 scales)}

beff=b+24g(asymmetric,ย FP16ย scaleย +ย INT8ย zero-point)b_{\text{eff}} = b + \frac{24}{g} \quad \text{(asymmetric, FP16 scale + INT8 zero-point)}

def effective_bits_per_weight(quant_bits, group_size, asymmetric=False):
    """Compute effective bits per weight including scale overhead."""
    scale_bits = 16  # FP16 scale
    zp_bits = 8 if asymmetric else 0
    overhead_bits = (scale_bits + zp_bits) / group_size
    return quant_bits + overhead_bits

print("Effective bits per weight (symmetric, FP16 scales):")
for bits in [4, 8]:
    for gs in [32, 64, 128, 256]:
        eff = effective_bits_per_weight(bits, gs, asymmetric=False)
        print(f"  INT{bits} g={gs:3d}: {eff:.2f} bits/weight "
              f"(overhead: {(eff - bits)/bits*100:.1f}%)")
Effective bits per weight (symmetric, FP16 scales):
  INT4 g= 32: 4.50 bits/weight (overhead: 12.5%)
  INT4 g= 64: 4.25 bits/weight (overhead: 6.2%)
  INT4 g=128: 4.12 bits/weight (overhead: 3.1%)
  INT4 g=256: 4.06 bits/weight (overhead: 1.6%)
  INT8 g= 32: 8.50 bits/weight (overhead: 6.2%)
  INT8 g= 64: 8.25 bits/weight (overhead: 3.1%)
  INT8 g=128: 8.12 bits/weight (overhead: 1.6%)
  INT8 g=256: 8.06 bits/weight (overhead: 0.8%)
๐Ÿ“Š

Effective Bits Per Weight: INT4 with FP16 Scales

Group SizeWeight BitsScale Bits/WeightEffective BitsOverhead %
32 4.00 0.50 4.50 12.5%
64 4.00 0.25 4.25 6.2%
128 4.00 0.125 4.12 3.1%
256 4.00 0.0625 4.06 1.6%
Per-channel (4096) 4.00 0.004 4.004 0.1%
Per-tensor 4.00 ~0 4.00 ~0%
Note: At group_size=32, scale metadata adds 12.5% overhead -- half a bit per weight. This is the primary cost of fine-grained grouping.

Kernel-Level Implications

The granularity choice affects the dequantization kernel that runs during inference. In a W4A16 kernel (4-bit weights, FP16 activations), the kernel must:

  1. Load quantized INT4 weights from global memory
  2. Load the corresponding scale factor(s)
  3. Dequantize to FP16 in registers
  4. Perform the FP16 matrix multiply

Per-tensor scaling requires loading one scale factor per GEMM. Per-channel requires one per output row. Per-group requires one per group within each row.

// Simplified W4A16 dequantization in CUDA
// Per-group variant (group_size = 128)

__device__ half dequantize_w4_per_group(
    uint8_t packed_byte,  // Two INT4 weights packed in one byte
    int weight_idx,       // Index within the row
    const half* scales,   // Scale array, one per group
    int group_size
) {
    // Extract 4-bit value (signed)
    int4_t w4;
    if (weight_idx % 2 == 0) {
        w4 = (int4_t)(packed_byte & 0x0F) - 8;  // Map [0,15] to [-8,7]
    } else {
        w4 = (int4_t)(packed_byte >> 4) - 8;
    }

    // Look up the group scale
    int group_idx = weight_idx / group_size;
    half scale = scales[group_idx];

    return __hmul(__int2half_rn(w4), scale);
}

The critical performance consideration is scale factor locality. With group_size=128, a 4096-element row has 32 groups, requiring 32 FP16 scale reads. These are typically loaded into shared memory at the start of a tile and reused across all activations that multiply with that weight tile.

// Tile-based W4A16 kernel sketch with per-group scales
__global__ void w4a16_gemm_per_group(
    const uint8_t* __restrict__ W_q,   // Quantized weights (packed INT4)
    const half* __restrict__ scales,     // Per-group scales
    const half* __restrict__ X,          // FP16 activations
    half* __restrict__ Y,                // Output
    int M, int N, int K,
    int group_size
) {
    // Each thread block handles a tile of the output
    const int tile_m = 128;
    const int tile_n = 128;
    const int tile_k = 64;  // Must be multiple of group_size for alignment

    __shared__ half smem_scales[tile_n / group_size * sizeof(half)];
    __shared__ half smem_W_deq[tile_n * tile_k];
    __shared__ half smem_X[tile_m * tile_k];

    // For each K-tile:
    // 1. Load scales for the K-range into shared memory
    // 2. Load packed INT4 weights, dequantize using shared-mem scales
    // 3. Load FP16 activations
    // 4. Accumulate tile GEMM using tensor cores (mma.sync)

    // The scale lookup is in shared memory, so it does not add
    // global memory traffic beyond the initial load
}
โ„น๏ธ Group Size Alignment

For efficient GPU kernels, the group size should divide the tile dimension along K. Common tile sizes are 64 or 128, so group_size=128 aligns perfectly. group_size=32 also aligns but requires more shared memory for scales. Odd group sizes like 48 or 96 cause misaligned scale lookups and degrade kernel performance.

Asymmetric vs Symmetric Quantization

The granularity discussion applies to both symmetric and asymmetric quantization. Symmetric quantization maps zero to zero and uses only a scale factor. Asymmetric quantization also uses a zero-point to handle distributions not centered at zero.

def quantize_per_group_asymmetric(weights, bits, group_size):
    """Per-group asymmetric quantization."""
    C_out, C_in = weights.shape
    qmax = 2 ** bits - 1
    qmin = 0

    num_groups = (C_in + group_size - 1) // group_size
    padded_C_in = num_groups * group_size

    if padded_C_in > C_in:
        weights_padded = np.zeros((C_out, padded_C_in), dtype=weights.dtype)
        weights_padded[:, :C_in] = weights
    else:
        weights_padded = weights

    reshaped = weights_padded.reshape(C_out, num_groups, group_size)

    w_min = np.min(reshaped, axis=2)   # (C_out, num_groups)
    w_max = np.max(reshaped, axis=2)

    scales = (w_max - w_min) / qmax
    scales = np.where(scales > 0, scales, 1.0)

    zero_points = np.round(-w_min / scales).astype(np.int32)
    zero_points = np.clip(zero_points, qmin, qmax)

    quantized = np.clip(
        np.round(reshaped / scales[:, :, np.newaxis] + zero_points[:, :, np.newaxis]),
        qmin, qmax
    ).astype(np.uint8)

    quantized = quantized.reshape(C_out, padded_C_in)[:, :C_in]
    return quantized, scales, zero_points

def dequantize_per_group_asymmetric(quantized, scales, zero_points, group_size):
    C_out, C_in = quantized.shape
    num_groups = scales.shape[1]
    padded_C_in = num_groups * group_size

    if padded_C_in > C_in:
        q_padded = np.zeros((C_out, padded_C_in), dtype=quantized.dtype)
        q_padded[:, :C_in] = quantized
    else:
        q_padded = quantized

    reshaped = q_padded.reshape(C_out, num_groups, group_size)
    dequantized = (reshaped.astype(np.float32) - zero_points[:, :, np.newaxis]) * scales[:, :, np.newaxis]
    return dequantized.reshape(C_out, padded_C_in)[:, :C_in]

When asymmetric matters: For weights, most LLM layers have approximately symmetric distributions (centered around zero), so symmetric quantization works well. However, after applying ReLU-like activations, outputs are non-negative, and asymmetric quantization recovers up to 1 additional bit of effective precision.

For INT4 quantization: symmetric maps to [โˆ’8,7][-8, 7] (16 levels), asymmetric maps to [0,15][0, 15] (16 levels). If the weight distribution is symmetric around zero, both use 16 levels equally well. If the distribution is shifted (e.g., all positive values after ReLU), symmetric wastes half its range on negative values, while asymmetric uses all 16 levels for the positive range.

Perplexity Impact: Real Model Benchmarks

The following benchmarks quantize Llama-2 7B weights at different granularities and measure perplexity on WikiText-2 (lower is better):

๐Ÿ“Š

Llama-2 7B WikiText-2 Perplexity by Quantization Granularity

MethodBitsGroup SizePerplexityModel Size (GB)Degradation
FP16 (baseline) 16 --- 5.47 13.0 ---
Per-tensor symmetric 8 --- 5.48 6.5 +0.01
Per-channel symmetric 8 --- 5.47 6.5 +0.00
Per-tensor symmetric 4 --- 8.92 3.3 +3.45
Per-channel symmetric 4 --- 7.31 3.3 +1.84
Per-group symmetric 4 256 5.84 3.4 +0.37
Per-group symmetric 4 128 5.68 3.5 +0.21
Per-group symmetric 4 64 5.59 3.6 +0.12
Per-group symmetric 4 32 5.54 3.8 +0.07
Note: At INT8, granularity barely matters. At INT4, per-tensor quantization is catastrophic (+3.45 ppl), per-channel helps (+1.84), and per-group with g=128 is nearly lossless (+0.21). This is RTN only -- GPTQ and AWQ further reduce the gap.

INT4 Perplexity vs Group Size (Llama-2 7B, RTN)

(WikiText-2 Perplexity)
FP16 baseline
5.47 WikiText-2 Perplexity
Per-group g=32
5.54 WikiText-2 Perplexity
Per-group g=64
5.59 WikiText-2 Perplexity
Per-group g=128 Default choice
5.68 WikiText-2 Perplexity
Per-group g=256
5.84 WikiText-2 Perplexity
Per-channel
7.31 WikiText-2 Perplexity
Per-tensor Unusable
8.92 WikiText-2 Perplexity

Key observations:

  1. INT8 is insensitive to granularity. With 256 levels, even per-tensor quantization distributes weights well enough that the error is negligible. Per-channel is slightly better and costs almost nothing, so INT8 deployments universally use per-channel scaling.

  2. INT4 is highly sensitive to granularity. With only 16 levels, per-tensor quantization causes massive quality loss. Per-group with g=128 reduces the loss to acceptable levels.

  3. Diminishing returns below g=128. Going from g=128 to g=64 saves 0.09 ppl but adds 3.1% overhead. Going from g=64 to g=32 saves 0.05 ppl but adds 6.2% overhead. The cost-benefit curve flattens rapidly.

Group Size Selection in Practice

The standard group sizes used by major quantization tools:

# GPTQ default: group_size=128
# AWQ default: group_size=128
# AutoRound: group_size=128
# llama.cpp Q4_K: effectively group_size=256 (super-blocks of 256)
# llama.cpp Q4_K_M: mixed, some blocks at 256, some at 32
# ExLlamaV2: group_size=128 or 32

GROUP_SIZE_RECOMMENDATIONS = {
    'int4_gpu': 128,    # Best tradeoff for GPU inference (W4A16)
    'int4_cpu': 256,    # llama.cpp default, lower overhead for CPU
    'int8_gpu': -1,     # Per-channel (no groups needed for INT8)
    'int4_quality': 32, # Maximum quality, accept overhead
    'int4_speed': 128,  # Default for Marlin kernel compatibility
}
๐Ÿ’ก The 128 Consensus

group_size=128 is the de facto standard for INT4 weight quantization. It provides a good tradeoff between quality and overhead, aligns well with GPU tile sizes, and is supported by all major inference kernels (Marlin, ExLlama, AutoAWQ). Unless you have a specific reason to choose differently, use 128.

End-to-End: Quantize a Linear Layer with Group Scaling

Here is a complete implementation that quantizes a PyTorch linear layer with per-group symmetric quantization and runs inference:

import torch
import torch.nn as nn

class QuantizedLinearPerGroup(nn.Module):
    """INT4 per-group quantized linear layer for inference."""

    def __init__(self, in_features, out_features, group_size=128, bits=4):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.group_size = group_size
        self.bits = bits
        self.qmax = 2 ** (bits - 1) - 1
        self.qmin = -(2 ** (bits - 1))

        num_groups = (in_features + group_size - 1) // group_size

        # Quantized weights stored as int8 (containing int4 values)
        self.register_buffer(
            'weight_q',
            torch.zeros(out_features, in_features, dtype=torch.int8)
        )
        # Per-group scale factors
        self.register_buffer(
            'scales',
            torch.zeros(out_features, num_groups, dtype=torch.float16)
        )

    @staticmethod
    def from_float(linear_layer, group_size=128, bits=4):
        """Quantize a float linear layer."""
        in_f = linear_layer.in_features
        out_f = linear_layer.out_features
        ql = QuantizedLinearPerGroup(in_f, out_f, group_size, bits)

        W = linear_layer.weight.data.float()  # (out_f, in_f)
        qmax = 2 ** (bits - 1) - 1

        num_groups = (in_f + group_size - 1) // group_size
        padded_in = num_groups * group_size

        if padded_in > in_f:
            W_pad = torch.zeros(out_f, padded_in)
            W_pad[:, :in_f] = W
            W = W_pad

        W_grouped = W.reshape(out_f, num_groups, group_size)
        abs_max = W_grouped.abs().amax(dim=2)  # (out_f, num_groups)
        scales = abs_max / qmax
        scales = scales.clamp(min=1e-10)

        W_q = (W_grouped / scales.unsqueeze(2)).round().clamp(-qmax - 1, qmax)
        W_q = W_q.reshape(out_f, padded_in)[:, :in_f].to(torch.int8)

        ql.weight_q.copy_(W_q)
        ql.scales.copy_(scales.half())
        return ql

    def forward(self, x):
        """Dequantize weights and compute linear transformation."""
        # Dequantize
        num_groups = self.scales.shape[1]
        padded_in = num_groups * self.group_size

        W_q = self.weight_q.float()
        if padded_in > self.in_features:
            W_pad = torch.zeros(
                self.out_features, padded_in,
                device=W_q.device
            )
            W_pad[:, :self.in_features] = W_q
            W_q = W_pad

        W_grouped = W_q.reshape(
            self.out_features, num_groups, self.group_size
        )
        scales = self.scales.float().unsqueeze(2)
        W_deq = (W_grouped * scales).reshape(
            self.out_features, padded_in
        )[:, :self.in_features]

        return x @ W_deq.T

# Usage
linear = nn.Linear(4096, 4096, bias=False)
nn.init.normal_(linear.weight, std=0.02)

ql = QuantizedLinearPerGroup.from_float(linear, group_size=128, bits=4)

x = torch.randn(1, 32, 4096)  # (batch, seq_len, hidden)

with torch.no_grad():
    y_fp = linear(x)
    y_q = ql(x)

# Measure output error
mse = ((y_fp - y_q) ** 2).mean().item()
cos_sim = torch.nn.functional.cosine_similarity(
    y_fp.flatten(), y_q.flatten(), dim=0
).item()

print(f"Output MSE: {mse:.6e}")
print(f"Cosine similarity: {cos_sim:.8f}")
# Output MSE: ~2.3e-05
# Cosine similarity: ~0.99998

Interaction with GPTQ and AWQ

The granularity choice is orthogonal to the quantization algorithm. GPTQ and AWQ both operate on top of per-group scaling:

  1. GPTQ uses Hessian-based error compensation to choose optimal rounding directions. It processes columns within each group, using the Hessian to redistribute rounding errors across remaining columns. The per-group scale is computed first, then GPTQ optimizes within that grid.

  2. AWQ identifies salient channels (those with large activation magnitudes) and upscales them before quantization, then downscales activations correspondingly. The per-group scale factor naturally gives salient channels more integer levels if they dominate the group maximum.

The combination of per-group scaling + GPTQ or AWQ is strictly better than per-group scaling + RTN. The granularity handles the scale mismatch; the algorithm handles the rounding optimization.

๐Ÿ“Š

INT4 Perplexity: Granularity x Algorithm (Llama-2 7B)

AlgorithmPer-TensorPer-ChannelPer-Group g=128
RTN 8.92 7.31 5.68
GPTQ 7.14 6.02 5.53
AWQ 6.88 5.89 5.51
Note: Both algorithm and granularity independently reduce error. Per-group g=128 + AWQ achieves 5.51 ppl vs 5.47 FP16 baseline -- a gap of only 0.04.

Scale Factor Data Types

The scale factors themselves can be stored in different precisions:

scale_formats = {
    'fp32': {'bytes': 4, 'range': '3.4e38', 'precision': '7 digits'},
    'fp16': {'bytes': 2, 'range': '65504', 'precision': '3 digits'},
    'bf16': {'bytes': 2, 'range': '3.4e38', 'precision': '2 digits'},
    'fp8_e4m3': {'bytes': 1, 'range': '448', 'precision': '1 digit'},
}

# Effective bits per weight with different scale formats
# For INT4, group_size=128:
for fmt, info in scale_formats.items():
    scale_bits = info['bytes'] * 8
    eff_bits = 4 + scale_bits / 128
    print(f"  {fmt:>10s}: {eff_bits:.3f} bits/weight "
          f"(+{scale_bits/128:.3f} overhead)")
       fp32: 4.250 bits/weight (+0.250 overhead)
       fp16: 4.125 bits/weight (+0.125 overhead)
       bf16: 4.125 bits/weight (+0.125 overhead)
  fp8_e4m3: 4.062 bits/weight (+0.062 overhead)

Most implementations use FP16 scales. FP8 scales are emerging on Hopper/Blackwell hardware where FP8 tensor core dequantization is native.

When to Use Each Granularity

Per-tensor (one scale for entire matrix):

  • INT8 weights when all channels have similar magnitude
  • Activation quantization where the range is computed per-token anyway
  • Never use for INT4 weight quantization

Per-channel (one scale per output row):

  • INT8 weight quantization (standard choice)
  • When kernel support for group quantization is unavailable
  • When minimizing metadata overhead matters more than quality

Per-group (one scale per group of gg elements):

  • INT4 weight quantization (always)
  • When outlier channels are present (always in LLMs)
  • group_size=128 as default, 32 for maximum quality, 256 for CPU inference