Part of Series Quantization Masterclass 17 of 30
1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest — Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier — Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 — How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization — The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization — OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment

Before GPTQ, the industry consensus was that 4-bit weights were too aggressive for production use—round-to-nearest quantization at INT4 destroyed model quality, and per-group scaling could only get you so far. Then in 2022, Frantar et al. published GPTQ and demonstrated that with the right algorithm, you could compress a 70B model to 4 bits with near-zero perplexity degradation. The key insight wasn’t just better rounding—it was using second-order information (the Hessian) to redistribute quantization errors across unquantized weights, turning what would be catastrophic rounding noise into a controlled, compensated approximation. GPTQ made 4-bit LLMs not just possible, but practical.

GPTQ is a one-shot weight quantization method based on Hessian information. It quantizes each weight column by column, using the inverse Hessian to optimally redistribute the rounding error of each column to all remaining unquantized columns. This error compensation is what separates GPTQ from round-to-nearest: instead of accepting the rounding error, GPTQ adjusts the remaining weights to compensate for it.

The algorithm descends from Optimal Brain Surgeon (OBS, Hasselmo et al. 1993), which uses second-order (Hessian) information to determine the optimal weight to prune and how to adjust the remaining weights to minimize the increase in loss. GPTQ adapts this to quantization: instead of pruning weights to zero, it rounds them to the nearest quantization level.

This post derives the algorithm from first principles, implements it from scratch, and covers every practical detail: Cholesky decomposition, lazy batch updates, act_order, and group quantization.

From OBS to GPTQ: The Mathematical Foundation

The Quantization Objective

For a linear layer Y=XWTY = XW^T where WRN×KW \in \mathbb{R}^{N \times K}, we want to find quantized weights W^\hat{W} that minimize the output reconstruction error:

L=XWTXW^TF2=X(WW^)TF2\mathcal{L} = \|XW^T - X\hat{W}^T\|_F^2 = \|X(W - \hat{W})^T\|_F^2

Let δ=vec(WW^)\delta = \text{vec}(W - \hat{W}) be the vectorized weight perturbation. The loss can be written:

L=iδiTHδi\mathcal{L} = \sum_i \delta_i^T H \delta_i

where each row ii of WW is treated independently, and H=XTXRK×KH = X^T X \in \mathbb{R}^{K \times K} is the Hessian of the layer output with respect to the weights (same for all rows).

The OBS Update Rule

OBS asks: if we fix weight wjw_j to a specific value (quantize it), what is the optimal adjustment to the remaining weights to minimize L\mathcal{L}?

For row w\mathbf{w} of WW, fixing column jj means:

wjw^j=Quantize(wj)w_j \rightarrow \hat{w}_j = \text{Quantize}(w_j)

The quantization error for column jj is:

δj=wjw^j\delta_j = w_j - \hat{w}_j

The optimal compensation for the remaining weights (columns kjk \neq j) is:

δrest=δjH:,j1Hjj1\delta_{\text{rest}} = -\delta_j \cdot \frac{H^{-1}_{:,j}}{H^{-1}_{jj}}

This means: after quantizing column jj, update every other column kk by:

wkwkδjHkj1Hjj1w_k \leftarrow w_k - \frac{\delta_j \cdot H^{-1}_{kj}}{H^{-1}_{jj}}

The resulting increase in loss from quantizing column jj is:

ΔLj=δj22Hjj1\Delta \mathcal{L}_j = \frac{\delta_j^2}{2 \cdot H^{-1}_{jj}}

import torch
import numpy as np

def obs_single_column_update(w_row, H_inv, col_j, quantize_fn):
    """Apply OBS update for quantizing a single column.

    Args:
        w_row: weight row, shape (K,), modified in-place
        H_inv: inverse Hessian, shape (K, K)
        col_j: column index to quantize
        quantize_fn: function that maps float -> quantized float

    Returns:
        w_q_j: quantized value of column j
        delta_loss: increase in loss from this quantization
    """
    w_j = w_row[col_j].item()
    w_q_j = quantize_fn(w_j)
    delta_j = w_j - w_q_j

    # Update remaining columns
    K = len(w_row)
    for k in range(K):
        if k != col_j:
            w_row[k] -= delta_j * H_inv[k, col_j] / H_inv[col_j, col_j]

    w_row[col_j] = w_q_j

    delta_loss = delta_j ** 2 / (2 * H_inv[col_j, col_j])
    return w_q_j, delta_loss

The Hessian: H=XTXH = X^T X

The Hessian for a linear layer is simply H=XTXH = X^T X where XX is the calibration data passed through the layer. This is computed from calibration data:

def compute_hessian(X_calibration):
    """Compute the Hessian H = X^T X from calibration data.

    Args:
        X_calibration: shape (num_tokens, K), FP32

    Returns:
        H: shape (K, K), symmetric positive semi-definite
    """
    # X^T X, normalized by number of tokens
    n_tokens = X_calibration.shape[0]
    H = X_calibration.T @ X_calibration  # (K, K)
    H /= n_tokens

    return H

def add_dampening(H, damp_pct=0.01):
    """Add dampening to make H positive definite.

    H_damped = H + damp * I, where damp = damp_pct * mean(diag(H))
    """
    diag_mean = torch.diag(H).mean()
    damp = damp_pct * diag_mean
    H_damped = H + damp * torch.eye(H.shape[0], device=H.device)
    return H_damped
ℹ️ Why the Hessian is the Same for All Rows

The Hessian H=XTXH = X^TX depends only on the input activations XX, not on the weights WW. Since every row of WW sees the same input XX, the Hessian is identical for all NN output rows. GPTQ computes the Hessian once and reuses it for all rows. This is why GPTQ runs in O(K3+NK2)O(K^3 + NK^2) time rather than O(NK3)O(NK^3).

Column-Wise Processing with Cholesky Decomposition

GPTQ processes columns left to right (j=0,1,,K1j = 0, 1, \ldots, K-1). After quantizing column jj, it updates H1H^{-1} to remove column jj from future consideration. The key insight is that this can be done efficiently using the Cholesky decomposition of H1H^{-1}.

Why Cholesky?

After quantizing column jj, we need the inverse Hessian of the remaining (Kj1)×(Kj1)(K - j - 1) \times (K - j - 1) submatrix. Rather than re-inverting the Hessian at each step (O(K3)O(K^3) per column), GPTQ precomputes the Cholesky factor of H1H^{-1} and reads off the needed values:

def gptq_cholesky(H):
    """Compute Cholesky decomposition of H^{-1} for GPTQ.

    H must be positive definite (use dampening).

    Returns L such that H^{-1} = L L^T (lower triangular).
    Actually, GPTQ uses the Cholesky of H^{-1}, not H.
    """
    # Invert H
    H_inv = torch.linalg.inv(H)

    # Cholesky decomposition: H_inv = L @ L^T
    # We use the upper triangular: H_inv = U^T @ U
    try:
        U = torch.linalg.cholesky(H_inv, upper=True)
    except RuntimeError:
        # If not positive definite, add more dampening
        H_inv += 1e-5 * torch.eye(H_inv.shape[0], device=H_inv.device)
        U = torch.linalg.cholesky(H_inv, upper=True)

    return H_inv, U

The diagonal elements of H1H^{-1} appear along the Cholesky factor and give exactly the Hjj1H^{-1}_{jj} values needed for the update formula. The off-diagonal elements of H1H^{-1} give the compensation coefficients.

The Full GPTQ Algorithm

def gptq_quantize_weight(
    W,              # (N, K) weight matrix
    H,              # (K, K) Hessian
    bits=4,
    group_size=128,
    damp_pct=0.01,
    act_order=False,
):
    """Full GPTQ quantization of a weight matrix.

    Args:
        W: FP32 weight matrix, shape (N, K)
        H: Hessian X^T X, shape (K, K)
        bits: quantization bits
        group_size: per-group quantization group size (-1 for per-channel)
        damp_pct: dampening percentage for Hessian
        act_order: if True, process columns in descending Hessian diagonal order

    Returns:
        W_q: quantized weight matrix, shape (N, K), dtype int8
        scales: per-group scales, shape (N, num_groups)
        errors: per-row quantization errors
    """
    N, K = W.shape
    qmax = 2 ** (bits - 1) - 1
    qmin = -(2 ** (bits - 1))

    # Add dampening
    H = H.clone()
    diag_mean = torch.diag(H).mean()
    H += damp_pct * diag_mean * torch.eye(K, device=H.device)

    # Column ordering
    if act_order:
        # Process columns with largest Hessian diagonal first
        # (most important columns first, for better error compensation)
        perm = torch.argsort(torch.diag(H), descending=True)
        W = W[:, perm]
        H = H[perm][:, perm]
    else:
        perm = torch.arange(K, device=W.device)

    # Compute H^{-1} via Cholesky
    H_inv = torch.cholesky_inverse(torch.linalg.cholesky(H))

    # Work on a copy
    W_work = W.clone()
    W_q = torch.zeros(N, K, dtype=torch.int8, device=W.device)
    scales = torch.zeros(
        N, K // group_size if group_size > 0 else 1,
        device=W.device
    )
    errors = torch.zeros(N, device=W.device)

    # Process columns left to right
    for j in range(K):
        # Determine the group for this column
        if group_size > 0:
            group_idx = j // group_size

            # At the start of a new group, compute group scale
            if j % group_size == 0:
                group_start = j
                group_end = min(j + group_size, K)
                group_weights = W_work[:, group_start:group_end]
                group_max = group_weights.abs().amax(dim=1)
                group_scale = group_max / qmax
                group_scale = group_scale.clamp(min=1e-10)
                scales[:, group_idx] = group_scale
        else:
            if j == 0:
                group_scale = W_work.abs().amax(dim=1) / qmax
                group_scale = group_scale.clamp(min=1e-10)
                scales[:, 0] = group_scale

        # Quantize column j
        w_j = W_work[:, j]  # (N,)
        q_j = (w_j / group_scale).round().clamp(qmin, qmax)
        W_q[:, j] = q_j.to(torch.int8)

        # Dequantized value
        w_hat_j = q_j * group_scale

        # Quantization error
        delta_j = w_j - w_hat_j  # (N,)

        # Update remaining columns using OBS formula
        # w[:, k] -= delta_j * H_inv[k, j] / H_inv[j, j]
        if j < K - 1:
            h_inv_jj = H_inv[j, j]
            if h_inv_jj > 0:
                compensation = delta_j.unsqueeze(1) * H_inv[j, j+1:].unsqueeze(0) / h_inv_jj
                W_work[:, j+1:] -= compensation

        # Accumulate error
        h_inv_jj = H_inv[j, j]
        if h_inv_jj > 0:
            errors += delta_j ** 2 / (2 * h_inv_jj)

    # Undo permutation if act_order was used
    if act_order:
        inv_perm = torch.argsort(perm)
        W_q = W_q[:, inv_perm]

    return W_q, scales, errors

Lazy Batch Updates

The column-by-column update W_work[:, j+1:] -= compensation is expensive: it touches all remaining columns for every quantized column. GPTQ uses “lazy batch” updates to amortize this cost.

Instead of updating the working weights after every column, GPTQ accumulates the updates and applies them in batches:

def gptq_quantize_lazy_batch(
    W, H, bits=4, group_size=128, damp_pct=0.01,
    block_size=128,  # Lazy batch size
):
    """GPTQ with lazy batch updates for efficiency.

    Instead of updating W after every column, accumulate updates
    and apply them every block_size columns. This reduces the
    number of large matrix operations from K to K/block_size.
    """
    N, K = W.shape
    qmax = 2 ** (bits - 1) - 1
    qmin = -(2 ** (bits - 1))

    H = H.clone()
    diag_mean = torch.diag(H).mean()
    H += damp_pct * diag_mean * torch.eye(K, device=H.device)

    H_inv = torch.cholesky_inverse(torch.linalg.cholesky(H))

    W_work = W.clone()
    W_q = torch.zeros_like(W)
    scales = torch.zeros(N, K // group_size, device=W.device)

    # Process in blocks
    for block_start in range(0, K, block_size):
        block_end = min(block_start + block_size, K)

        # Extract the block of H_inv we need
        H_inv_block = H_inv[block_start:block_end, block_start:block_end]

        # Accumulated error for this block (to be propagated to next block)
        block_errors = torch.zeros(N, block_end - block_start, device=W.device)

        for j_local in range(block_end - block_start):
            j = block_start + j_local

            # Group scale
            if group_size > 0 and j % group_size == 0:
                gi = j // group_size
                gs = min(j + group_size, K)
                g_w = W_work[:, j:gs]
                g_max = g_w.abs().amax(dim=1)
                g_scale = (g_max / qmax).clamp(min=1e-10)
                scales[:, gi] = g_scale

            # Quantize column j
            w_j = W_work[:, j]
            q_j = (w_j / g_scale).round().clamp(qmin, qmax)
            W_q[:, j] = q_j
            w_hat_j = q_j * g_scale
            delta_j = w_j - w_hat_j

            block_errors[:, j_local] = delta_j

            # Update remaining columns WITHIN this block only
            h_jj = H_inv_block[j_local, j_local]
            if h_jj > 0 and j_local < block_end - block_start - 1:
                comp = delta_j.unsqueeze(1) * H_inv_block[j_local, j_local+1:].unsqueeze(0) / h_jj
                W_work[:, j+1:block_end] -= comp

        # Propagate accumulated errors to remaining columns (lazy update)
        if block_end < K:
            # This is the key lazy batch step:
            # Update all columns after the block using the accumulated errors
            H_inv_cross = H_inv[block_start:block_end, block_end:]  # (block_size, K-block_end)
            H_inv_diag = torch.diag(H_inv_block)  # (block_size,)

            # Weighted compensation: sum of (error_j * H_inv[j, k] / H_inv[j, j])
            # for all j in the block
            weighted_errors = block_errors / H_inv_diag.unsqueeze(0)  # (N, block_size)
            compensation = weighted_errors @ H_inv_cross  # (N, K - block_end)
            W_work[:, block_end:] -= compensation

    return W_q, scales
Lazy Batch: O(K^2) to O(K^2 / B) Matrix Operations

Without lazy batching, GPTQ performs K large matrix updates, each touching up to K columns. With batch size B=128, the number of large updates reduces from K to K/B. For K=4096 and B=128, this is a 32x reduction in the number of matrix operations, reducing quantization time from hours to minutes.

Act-Order: Processing Columns by Importance

The act_order (activation order) option reorders columns so that the most important columns (those with the largest Hessian diagonal, meaning the highest activation magnitudes) are quantized first.

Why does order matter? The OBS compensation mechanism is one-directional: when we quantize column jj, we compensate by adjusting columns j+1,j+2,,K1j+1, j+2, \ldots, K-1. The earlier a column is processed, the more remaining columns are available for compensation.

def gptq_with_act_order(W, H, bits=4, group_size=128, damp_pct=0.01):
    """GPTQ with act_order (column reordering by importance).

    Process columns in descending order of H[j,j] (diagonal of Hessian).
    Columns with larger H[j,j] have higher activation magnitudes and
    should be quantized first (more compensation available).
    """
    N, K = W.shape

    # Add dampening
    H = H.clone()
    H += damp_pct * torch.diag(H).mean() * torch.eye(K, device=H.device)

    # Column importance = diagonal of Hessian
    col_importance = torch.diag(H)  # (K,)

    # Sort: most important first
    perm = torch.argsort(col_importance, descending=True)
    inv_perm = torch.argsort(perm)

    # Reorder weights and Hessian
    W_perm = W[:, perm].clone()
    H_perm = H[perm][:, perm].clone()

    # Run GPTQ on reordered matrix
    W_q_perm, scales = gptq_quantize_lazy_batch(
        W_perm, H_perm, bits=bits,
        group_size=group_size, damp_pct=0,  # Already dampened
    )

    # Un-permute the quantized weights
    W_q = W_q_perm[:, inv_perm]

    return W_q, scales, perm
⚠️ Act-Order Breaks Marlin Compatibility

When act_order is enabled, the quantized weight columns are in a different order than the original. The dequantization kernel must know the permutation to correctly map between the quantized storage and the linear algebra. Marlin does not support act_order, so models quantized with act_order fall back to slower ExLlama kernels in vLLM. For deployment with Marlin, use act_order=False.

Numerical Stability: Dampening and Cholesky

The Hessian H=XTXH = X^TX can be singular or nearly singular if calibration data does not excite all directions in activation space. This causes H1H^{-1} to have very large values, leading to numerically unstable compensation updates.

def analyze_hessian_conditioning(H):
    """Analyze the conditioning of the Hessian for GPTQ stability."""
    eigenvalues = torch.linalg.eigvalsh(H)  # Sorted ascending

    condition_number = eigenvalues[-1] / eigenvalues[0].clamp(min=1e-20)
    num_near_zero = (eigenvalues < eigenvalues[-1] * 1e-6).sum().item()
    rank = (eigenvalues > eigenvalues[-1] * 1e-10).sum().item()

    print(f"  Condition number: {condition_number:.2e}")
    print(f"  Rank: {rank} / {H.shape[0]}")
    print(f"  Near-zero eigenvalues: {num_near_zero}")
    print(f"  Min eigenvalue: {eigenvalues[0]:.2e}")
    print(f"  Max eigenvalue: {eigenvalues[-1]:.2e}")

    return condition_number

def recommend_dampening(H):
    """Recommend dampening percentage based on Hessian conditioning."""
    cond = analyze_hessian_conditioning(H)

    if cond > 1e10:
        return 0.1, "High dampening needed (poorly conditioned)"
    elif cond > 1e6:
        return 0.01, "Standard dampening (typical)"
    else:
        return 0.001, "Light dampening (well conditioned)"

Complete GPTQ Implementation

Here is the full GPTQ implementation combining all techniques:

class GPTQQuantizer:
    """Complete GPTQ implementation for a single linear layer."""

    def __init__(self, bits=4, group_size=128, damp_pct=0.01,
                 act_order=False, block_size=128):
        self.bits = bits
        self.group_size = group_size
        self.damp_pct = damp_pct
        self.act_order = act_order
        self.block_size = block_size
        self.qmax = 2 ** (bits - 1) - 1
        self.qmin = -(2 ** (bits - 1))

    def quantize(self, W, H):
        """Quantize weight matrix W using Hessian H.

        Args:
            W: (N, K) float weight matrix
            H: (K, K) Hessian (X^T X / n_tokens)

        Returns:
            W_q: (N, K) quantized weights (int range)
            scales: (N, num_groups) per-group scale factors
            perm: column permutation (identity if act_order=False)
        """
        N, K = W.shape
        device = W.device

        # Dampening
        H = H.clone()
        damp = self.damp_pct * torch.diag(H).mean()
        H += damp * torch.eye(K, device=device)

        # Act-order permutation
        if self.act_order:
            perm = torch.argsort(torch.diag(H), descending=True)
            W = W[:, perm]
            H = H[perm][:, perm]
        else:
            perm = torch.arange(K, device=device)

        # Cholesky of H^{-1}
        try:
            L = torch.linalg.cholesky(H)
            H_inv = torch.cholesky_inverse(L)
        except RuntimeError:
            H += 1e-4 * torch.eye(K, device=device)
            L = torch.linalg.cholesky(H)
            H_inv = torch.cholesky_inverse(L)

        # Allocate outputs
        num_groups = K // self.group_size if self.group_size > 0 else 1
        W_q = torch.zeros(N, K, device=device)
        scales = torch.zeros(N, num_groups, device=device)
        W_work = W.clone()

        # Process in lazy batches
        for block_start in range(0, K, self.block_size):
            block_end = min(block_start + self.block_size, K)
            bsize = block_end - block_start

            H_inv_block = H_inv[block_start:block_end, block_start:block_end]
            errors = torch.zeros(N, bsize, device=device)

            for j_local in range(bsize):
                j = block_start + j_local

                # Compute group scale at group boundaries
                if self.group_size > 0 and j % self.group_size == 0:
                    gi = j // self.group_size
                    ge = min(j + self.group_size, K)
                    gmax = W_work[:, j:ge].abs().amax(dim=1)
                    g_scale = (gmax / self.qmax).clamp(min=1e-10)
                    scales[:, gi] = g_scale

                # Quantize
                w_j = W_work[:, j]
                q_j = (w_j / g_scale).round().clamp(self.qmin, self.qmax)
                W_q[:, j] = q_j
                delta = w_j - q_j * g_scale
                errors[:, j_local] = delta

                # Intra-block compensation
                h_jj = H_inv_block[j_local, j_local]
                if h_jj > 1e-10 and j_local < bsize - 1:
                    coeff = H_inv_block[j_local, j_local+1:] / h_jj
                    W_work[:, j+1:block_end] -= delta.unsqueeze(1) * coeff.unsqueeze(0)

            # Inter-block compensation (lazy batch)
            if block_end < K:
                H_inv_cross = H_inv[block_start:block_end, block_end:]
                H_inv_diag = torch.diag(H_inv_block).clamp(min=1e-10)
                weighted = errors / H_inv_diag.unsqueeze(0)
                W_work[:, block_end:] -= weighted @ H_inv_cross

        # Un-permute
        if self.act_order:
            inv_perm = torch.argsort(perm)
            W_q = W_q[:, inv_perm]

        return W_q.to(torch.int8), scales, perm

Quantization Time Complexity

def gptq_time_complexity(N, K, block_size=128):
    """Analyze GPTQ time complexity.

    Step 1: Hessian computation: O(n * K^2) where n = calibration tokens
    Step 2: Cholesky inversion: O(K^3)
    Step 3: Column-wise quantization with lazy batch:
            - Intra-block: O(N * K * B) per block, K/B blocks = O(N * K^2)
            - Inter-block: O(N * B * (K-block)) per block = O(N * K^2)
    Total: O(K^3 + N * K^2)

    For K=4096, N=4096: K^3 = 68B, N*K^2 = 68B -> 136B operations
    """
    hessian_ops = K ** 3  # Cholesky + inversion
    quant_ops = N * K ** 2  # Column-wise updates
    total = hessian_ops + quant_ops

    return {
        'hessian_ops': hessian_ops,
        'quant_ops': quant_ops,
        'total_ops': total,
        'estimated_time_s': total / 1e12,  # ~1 TFLOP/s on CPU, ~100 TFLOP/s on GPU
    }

for K in [4096, 8192, 16384]:
    result = gptq_time_complexity(4096, K)
    print(f"  K={K}: total={result['total_ops']/1e9:.1f}G ops, "
          f"~{result['estimated_time_s']:.1f}s on GPU")
📊

GPTQ Quantization Time by Model Size (A100 GPU)

ModelParametersLayersTime (INT4 g128)Time (act_order)
Llama-2 7B 6.7B 32 ~8 min ~12 min
Llama-2 13B 13B 40 ~20 min ~30 min
Llama-2 70B 70B 80 ~3 hours ~4.5 hours
Mixtral 8x7B 47B 32 x 8 ~2 hours ~3 hours
Note: GPTQ quantization is dominated by the Cholesky inversion and column-wise updates. act_order adds ~50% overhead from the permutation and scale recomputation.

GPTQ vs AWQ: When to Use Which

📊

GPTQ vs AWQ Comparison

AspectGPTQAWQ
Optimization target Per-column Hessian compensation Per-channel activation-aware scaling
Requires calibration Yes (128 samples) Yes (128 samples)
Quantization time (7B) ~8 min ~5 min
INT4 g128 perplexity 5.53 5.51
act_order support Yes (better quality, slower kernel) No
Marlin kernel compatible Yes (without act_order) Yes
Supports asymmetric Yes Yes
Memory during quantization High (stores H^{-1}) Low (grid search)
Note: AWQ is slightly faster to quantize and gives marginally better perplexity. GPTQ with act_order gives the best absolute quality but breaks Marlin compatibility. For production, AWQ is generally preferred.

GPTQ vs AWQ Perplexity Across Bit-widths (Llama-2 7B)

(WikiText-2 Perplexity)
FP16
5.47 WikiText-2 Perplexity
AWQ INT4 Slightly better
5.51 WikiText-2 Perplexity
GPTQ INT4
5.53 WikiText-2 Perplexity
GPTQ INT4 act_order Best quality
5.5 WikiText-2 Perplexity
AWQ INT3
6.72 WikiText-2 Perplexity
GPTQ INT3
6.98 WikiText-2 Perplexity