Before GPTQ, the industry consensus was that 4-bit weights were too aggressive for production use—round-to-nearest quantization at INT4 destroyed model quality, and per-group scaling could only get you so far. Then in 2022, Frantar et al. published GPTQ and demonstrated that with the right algorithm, you could compress a 70B model to 4 bits with near-zero perplexity degradation. The key insight wasn’t just better rounding—it was using second-order information (the Hessian) to redistribute quantization errors across unquantized weights, turning what would be catastrophic rounding noise into a controlled, compensated approximation. GPTQ made 4-bit LLMs not just possible, but practical.
GPTQ is a one-shot weight quantization method based on Hessian information. It quantizes each weight column by column, using the inverse Hessian to optimally redistribute the rounding error of each column to all remaining unquantized columns. This error compensation is what separates GPTQ from round-to-nearest: instead of accepting the rounding error, GPTQ adjusts the remaining weights to compensate for it.
The algorithm descends from Optimal Brain Surgeon (OBS, Hasselmo et al. 1993), which uses second-order (Hessian) information to determine the optimal weight to prune and how to adjust the remaining weights to minimize the increase in loss. GPTQ adapts this to quantization: instead of pruning weights to zero, it rounds them to the nearest quantization level.
This post derives the algorithm from first principles, implements it from scratch, and covers every practical detail: Cholesky decomposition, lazy batch updates, act_order, and group quantization.
From OBS to GPTQ: The Mathematical Foundation
The Quantization Objective
For a linear layer where , we want to find quantized weights that minimize the output reconstruction error:
Let be the vectorized weight perturbation. The loss can be written:
where each row of is treated independently, and is the Hessian of the layer output with respect to the weights (same for all rows).
The OBS Update Rule
OBS asks: if we fix weight to a specific value (quantize it), what is the optimal adjustment to the remaining weights to minimize ?
For row of , fixing column means:
The quantization error for column is:
The optimal compensation for the remaining weights (columns ) is:
This means: after quantizing column , update every other column by:
The resulting increase in loss from quantizing column is:
import torch
import numpy as np
def obs_single_column_update(w_row, H_inv, col_j, quantize_fn):
"""Apply OBS update for quantizing a single column.
Args:
w_row: weight row, shape (K,), modified in-place
H_inv: inverse Hessian, shape (K, K)
col_j: column index to quantize
quantize_fn: function that maps float -> quantized float
Returns:
w_q_j: quantized value of column j
delta_loss: increase in loss from this quantization
"""
w_j = w_row[col_j].item()
w_q_j = quantize_fn(w_j)
delta_j = w_j - w_q_j
# Update remaining columns
K = len(w_row)
for k in range(K):
if k != col_j:
w_row[k] -= delta_j * H_inv[k, col_j] / H_inv[col_j, col_j]
w_row[col_j] = w_q_j
delta_loss = delta_j ** 2 / (2 * H_inv[col_j, col_j])
return w_q_j, delta_loss
The Hessian:
The Hessian for a linear layer is simply where is the calibration data passed through the layer. This is computed from calibration data:
def compute_hessian(X_calibration):
"""Compute the Hessian H = X^T X from calibration data.
Args:
X_calibration: shape (num_tokens, K), FP32
Returns:
H: shape (K, K), symmetric positive semi-definite
"""
# X^T X, normalized by number of tokens
n_tokens = X_calibration.shape[0]
H = X_calibration.T @ X_calibration # (K, K)
H /= n_tokens
return H
def add_dampening(H, damp_pct=0.01):
"""Add dampening to make H positive definite.
H_damped = H + damp * I, where damp = damp_pct * mean(diag(H))
"""
diag_mean = torch.diag(H).mean()
damp = damp_pct * diag_mean
H_damped = H + damp * torch.eye(H.shape[0], device=H.device)
return H_damped
The Hessian depends only on the input activations , not on the weights . Since every row of sees the same input , the Hessian is identical for all output rows. GPTQ computes the Hessian once and reuses it for all rows. This is why GPTQ runs in time rather than .
Column-Wise Processing with Cholesky Decomposition
GPTQ processes columns left to right (). After quantizing column , it updates to remove column from future consideration. The key insight is that this can be done efficiently using the Cholesky decomposition of .
Why Cholesky?
After quantizing column , we need the inverse Hessian of the remaining submatrix. Rather than re-inverting the Hessian at each step ( per column), GPTQ precomputes the Cholesky factor of and reads off the needed values:
def gptq_cholesky(H):
"""Compute Cholesky decomposition of H^{-1} for GPTQ.
H must be positive definite (use dampening).
Returns L such that H^{-1} = L L^T (lower triangular).
Actually, GPTQ uses the Cholesky of H^{-1}, not H.
"""
# Invert H
H_inv = torch.linalg.inv(H)
# Cholesky decomposition: H_inv = L @ L^T
# We use the upper triangular: H_inv = U^T @ U
try:
U = torch.linalg.cholesky(H_inv, upper=True)
except RuntimeError:
# If not positive definite, add more dampening
H_inv += 1e-5 * torch.eye(H_inv.shape[0], device=H_inv.device)
U = torch.linalg.cholesky(H_inv, upper=True)
return H_inv, U
The diagonal elements of appear along the Cholesky factor and give exactly the values needed for the update formula. The off-diagonal elements of give the compensation coefficients.
The Full GPTQ Algorithm
def gptq_quantize_weight(
W, # (N, K) weight matrix
H, # (K, K) Hessian
bits=4,
group_size=128,
damp_pct=0.01,
act_order=False,
):
"""Full GPTQ quantization of a weight matrix.
Args:
W: FP32 weight matrix, shape (N, K)
H: Hessian X^T X, shape (K, K)
bits: quantization bits
group_size: per-group quantization group size (-1 for per-channel)
damp_pct: dampening percentage for Hessian
act_order: if True, process columns in descending Hessian diagonal order
Returns:
W_q: quantized weight matrix, shape (N, K), dtype int8
scales: per-group scales, shape (N, num_groups)
errors: per-row quantization errors
"""
N, K = W.shape
qmax = 2 ** (bits - 1) - 1
qmin = -(2 ** (bits - 1))
# Add dampening
H = H.clone()
diag_mean = torch.diag(H).mean()
H += damp_pct * diag_mean * torch.eye(K, device=H.device)
# Column ordering
if act_order:
# Process columns with largest Hessian diagonal first
# (most important columns first, for better error compensation)
perm = torch.argsort(torch.diag(H), descending=True)
W = W[:, perm]
H = H[perm][:, perm]
else:
perm = torch.arange(K, device=W.device)
# Compute H^{-1} via Cholesky
H_inv = torch.cholesky_inverse(torch.linalg.cholesky(H))
# Work on a copy
W_work = W.clone()
W_q = torch.zeros(N, K, dtype=torch.int8, device=W.device)
scales = torch.zeros(
N, K // group_size if group_size > 0 else 1,
device=W.device
)
errors = torch.zeros(N, device=W.device)
# Process columns left to right
for j in range(K):
# Determine the group for this column
if group_size > 0:
group_idx = j // group_size
# At the start of a new group, compute group scale
if j % group_size == 0:
group_start = j
group_end = min(j + group_size, K)
group_weights = W_work[:, group_start:group_end]
group_max = group_weights.abs().amax(dim=1)
group_scale = group_max / qmax
group_scale = group_scale.clamp(min=1e-10)
scales[:, group_idx] = group_scale
else:
if j == 0:
group_scale = W_work.abs().amax(dim=1) / qmax
group_scale = group_scale.clamp(min=1e-10)
scales[:, 0] = group_scale
# Quantize column j
w_j = W_work[:, j] # (N,)
q_j = (w_j / group_scale).round().clamp(qmin, qmax)
W_q[:, j] = q_j.to(torch.int8)
# Dequantized value
w_hat_j = q_j * group_scale
# Quantization error
delta_j = w_j - w_hat_j # (N,)
# Update remaining columns using OBS formula
# w[:, k] -= delta_j * H_inv[k, j] / H_inv[j, j]
if j < K - 1:
h_inv_jj = H_inv[j, j]
if h_inv_jj > 0:
compensation = delta_j.unsqueeze(1) * H_inv[j, j+1:].unsqueeze(0) / h_inv_jj
W_work[:, j+1:] -= compensation
# Accumulate error
h_inv_jj = H_inv[j, j]
if h_inv_jj > 0:
errors += delta_j ** 2 / (2 * h_inv_jj)
# Undo permutation if act_order was used
if act_order:
inv_perm = torch.argsort(perm)
W_q = W_q[:, inv_perm]
return W_q, scales, errors
Lazy Batch Updates
The column-by-column update W_work[:, j+1:] -= compensation is expensive: it touches all remaining columns for every quantized column. GPTQ uses “lazy batch” updates to amortize this cost.
Instead of updating the working weights after every column, GPTQ accumulates the updates and applies them in batches:
def gptq_quantize_lazy_batch(
W, H, bits=4, group_size=128, damp_pct=0.01,
block_size=128, # Lazy batch size
):
"""GPTQ with lazy batch updates for efficiency.
Instead of updating W after every column, accumulate updates
and apply them every block_size columns. This reduces the
number of large matrix operations from K to K/block_size.
"""
N, K = W.shape
qmax = 2 ** (bits - 1) - 1
qmin = -(2 ** (bits - 1))
H = H.clone()
diag_mean = torch.diag(H).mean()
H += damp_pct * diag_mean * torch.eye(K, device=H.device)
H_inv = torch.cholesky_inverse(torch.linalg.cholesky(H))
W_work = W.clone()
W_q = torch.zeros_like(W)
scales = torch.zeros(N, K // group_size, device=W.device)
# Process in blocks
for block_start in range(0, K, block_size):
block_end = min(block_start + block_size, K)
# Extract the block of H_inv we need
H_inv_block = H_inv[block_start:block_end, block_start:block_end]
# Accumulated error for this block (to be propagated to next block)
block_errors = torch.zeros(N, block_end - block_start, device=W.device)
for j_local in range(block_end - block_start):
j = block_start + j_local
# Group scale
if group_size > 0 and j % group_size == 0:
gi = j // group_size
gs = min(j + group_size, K)
g_w = W_work[:, j:gs]
g_max = g_w.abs().amax(dim=1)
g_scale = (g_max / qmax).clamp(min=1e-10)
scales[:, gi] = g_scale
# Quantize column j
w_j = W_work[:, j]
q_j = (w_j / g_scale).round().clamp(qmin, qmax)
W_q[:, j] = q_j
w_hat_j = q_j * g_scale
delta_j = w_j - w_hat_j
block_errors[:, j_local] = delta_j
# Update remaining columns WITHIN this block only
h_jj = H_inv_block[j_local, j_local]
if h_jj > 0 and j_local < block_end - block_start - 1:
comp = delta_j.unsqueeze(1) * H_inv_block[j_local, j_local+1:].unsqueeze(0) / h_jj
W_work[:, j+1:block_end] -= comp
# Propagate accumulated errors to remaining columns (lazy update)
if block_end < K:
# This is the key lazy batch step:
# Update all columns after the block using the accumulated errors
H_inv_cross = H_inv[block_start:block_end, block_end:] # (block_size, K-block_end)
H_inv_diag = torch.diag(H_inv_block) # (block_size,)
# Weighted compensation: sum of (error_j * H_inv[j, k] / H_inv[j, j])
# for all j in the block
weighted_errors = block_errors / H_inv_diag.unsqueeze(0) # (N, block_size)
compensation = weighted_errors @ H_inv_cross # (N, K - block_end)
W_work[:, block_end:] -= compensation
return W_q, scales
Without lazy batching, GPTQ performs K large matrix updates, each touching up to K columns. With batch size B=128, the number of large updates reduces from K to K/B. For K=4096 and B=128, this is a 32x reduction in the number of matrix operations, reducing quantization time from hours to minutes.
Act-Order: Processing Columns by Importance
The act_order (activation order) option reorders columns so that the most important columns (those with the largest Hessian diagonal, meaning the highest activation magnitudes) are quantized first.
Why does order matter? The OBS compensation mechanism is one-directional: when we quantize column , we compensate by adjusting columns . The earlier a column is processed, the more remaining columns are available for compensation.
def gptq_with_act_order(W, H, bits=4, group_size=128, damp_pct=0.01):
"""GPTQ with act_order (column reordering by importance).
Process columns in descending order of H[j,j] (diagonal of Hessian).
Columns with larger H[j,j] have higher activation magnitudes and
should be quantized first (more compensation available).
"""
N, K = W.shape
# Add dampening
H = H.clone()
H += damp_pct * torch.diag(H).mean() * torch.eye(K, device=H.device)
# Column importance = diagonal of Hessian
col_importance = torch.diag(H) # (K,)
# Sort: most important first
perm = torch.argsort(col_importance, descending=True)
inv_perm = torch.argsort(perm)
# Reorder weights and Hessian
W_perm = W[:, perm].clone()
H_perm = H[perm][:, perm].clone()
# Run GPTQ on reordered matrix
W_q_perm, scales = gptq_quantize_lazy_batch(
W_perm, H_perm, bits=bits,
group_size=group_size, damp_pct=0, # Already dampened
)
# Un-permute the quantized weights
W_q = W_q_perm[:, inv_perm]
return W_q, scales, perm
When act_order is enabled, the quantized weight columns are in a different order than the original. The dequantization kernel must know the permutation to correctly map between the quantized storage and the linear algebra. Marlin does not support act_order, so models quantized with act_order fall back to slower ExLlama kernels in vLLM. For deployment with Marlin, use act_order=False.
Numerical Stability: Dampening and Cholesky
The Hessian can be singular or nearly singular if calibration data does not excite all directions in activation space. This causes to have very large values, leading to numerically unstable compensation updates.
def analyze_hessian_conditioning(H):
"""Analyze the conditioning of the Hessian for GPTQ stability."""
eigenvalues = torch.linalg.eigvalsh(H) # Sorted ascending
condition_number = eigenvalues[-1] / eigenvalues[0].clamp(min=1e-20)
num_near_zero = (eigenvalues < eigenvalues[-1] * 1e-6).sum().item()
rank = (eigenvalues > eigenvalues[-1] * 1e-10).sum().item()
print(f" Condition number: {condition_number:.2e}")
print(f" Rank: {rank} / {H.shape[0]}")
print(f" Near-zero eigenvalues: {num_near_zero}")
print(f" Min eigenvalue: {eigenvalues[0]:.2e}")
print(f" Max eigenvalue: {eigenvalues[-1]:.2e}")
return condition_number
def recommend_dampening(H):
"""Recommend dampening percentage based on Hessian conditioning."""
cond = analyze_hessian_conditioning(H)
if cond > 1e10:
return 0.1, "High dampening needed (poorly conditioned)"
elif cond > 1e6:
return 0.01, "Standard dampening (typical)"
else:
return 0.001, "Light dampening (well conditioned)"
Complete GPTQ Implementation
Here is the full GPTQ implementation combining all techniques:
class GPTQQuantizer:
"""Complete GPTQ implementation for a single linear layer."""
def __init__(self, bits=4, group_size=128, damp_pct=0.01,
act_order=False, block_size=128):
self.bits = bits
self.group_size = group_size
self.damp_pct = damp_pct
self.act_order = act_order
self.block_size = block_size
self.qmax = 2 ** (bits - 1) - 1
self.qmin = -(2 ** (bits - 1))
def quantize(self, W, H):
"""Quantize weight matrix W using Hessian H.
Args:
W: (N, K) float weight matrix
H: (K, K) Hessian (X^T X / n_tokens)
Returns:
W_q: (N, K) quantized weights (int range)
scales: (N, num_groups) per-group scale factors
perm: column permutation (identity if act_order=False)
"""
N, K = W.shape
device = W.device
# Dampening
H = H.clone()
damp = self.damp_pct * torch.diag(H).mean()
H += damp * torch.eye(K, device=device)
# Act-order permutation
if self.act_order:
perm = torch.argsort(torch.diag(H), descending=True)
W = W[:, perm]
H = H[perm][:, perm]
else:
perm = torch.arange(K, device=device)
# Cholesky of H^{-1}
try:
L = torch.linalg.cholesky(H)
H_inv = torch.cholesky_inverse(L)
except RuntimeError:
H += 1e-4 * torch.eye(K, device=device)
L = torch.linalg.cholesky(H)
H_inv = torch.cholesky_inverse(L)
# Allocate outputs
num_groups = K // self.group_size if self.group_size > 0 else 1
W_q = torch.zeros(N, K, device=device)
scales = torch.zeros(N, num_groups, device=device)
W_work = W.clone()
# Process in lazy batches
for block_start in range(0, K, self.block_size):
block_end = min(block_start + self.block_size, K)
bsize = block_end - block_start
H_inv_block = H_inv[block_start:block_end, block_start:block_end]
errors = torch.zeros(N, bsize, device=device)
for j_local in range(bsize):
j = block_start + j_local
# Compute group scale at group boundaries
if self.group_size > 0 and j % self.group_size == 0:
gi = j // self.group_size
ge = min(j + self.group_size, K)
gmax = W_work[:, j:ge].abs().amax(dim=1)
g_scale = (gmax / self.qmax).clamp(min=1e-10)
scales[:, gi] = g_scale
# Quantize
w_j = W_work[:, j]
q_j = (w_j / g_scale).round().clamp(self.qmin, self.qmax)
W_q[:, j] = q_j
delta = w_j - q_j * g_scale
errors[:, j_local] = delta
# Intra-block compensation
h_jj = H_inv_block[j_local, j_local]
if h_jj > 1e-10 and j_local < bsize - 1:
coeff = H_inv_block[j_local, j_local+1:] / h_jj
W_work[:, j+1:block_end] -= delta.unsqueeze(1) * coeff.unsqueeze(0)
# Inter-block compensation (lazy batch)
if block_end < K:
H_inv_cross = H_inv[block_start:block_end, block_end:]
H_inv_diag = torch.diag(H_inv_block).clamp(min=1e-10)
weighted = errors / H_inv_diag.unsqueeze(0)
W_work[:, block_end:] -= weighted @ H_inv_cross
# Un-permute
if self.act_order:
inv_perm = torch.argsort(perm)
W_q = W_q[:, inv_perm]
return W_q.to(torch.int8), scales, perm
Quantization Time Complexity
def gptq_time_complexity(N, K, block_size=128):
"""Analyze GPTQ time complexity.
Step 1: Hessian computation: O(n * K^2) where n = calibration tokens
Step 2: Cholesky inversion: O(K^3)
Step 3: Column-wise quantization with lazy batch:
- Intra-block: O(N * K * B) per block, K/B blocks = O(N * K^2)
- Inter-block: O(N * B * (K-block)) per block = O(N * K^2)
Total: O(K^3 + N * K^2)
For K=4096, N=4096: K^3 = 68B, N*K^2 = 68B -> 136B operations
"""
hessian_ops = K ** 3 # Cholesky + inversion
quant_ops = N * K ** 2 # Column-wise updates
total = hessian_ops + quant_ops
return {
'hessian_ops': hessian_ops,
'quant_ops': quant_ops,
'total_ops': total,
'estimated_time_s': total / 1e12, # ~1 TFLOP/s on CPU, ~100 TFLOP/s on GPU
}
for K in [4096, 8192, 16384]:
result = gptq_time_complexity(4096, K)
print(f" K={K}: total={result['total_ops']/1e9:.1f}G ops, "
f"~{result['estimated_time_s']:.1f}s on GPU")
GPTQ Quantization Time by Model Size (A100 GPU)
| Model | Parameters | Layers | Time (INT4 g128) | Time (act_order) |
|---|---|---|---|---|
| Llama-2 7B | 6.7B | 32 | ~8 min | ~12 min |
| Llama-2 13B | 13B | 40 | ~20 min | ~30 min |
| Llama-2 70B | 70B | 80 | ~3 hours | ~4.5 hours |
| Mixtral 8x7B | 47B | 32 x 8 | ~2 hours | ~3 hours |
GPTQ vs AWQ: When to Use Which
GPTQ vs AWQ Comparison
| Aspect | GPTQ | AWQ |
|---|---|---|
| Optimization target | Per-column Hessian compensation | Per-channel activation-aware scaling |
| Requires calibration | Yes (128 samples) | Yes (128 samples) |
| Quantization time (7B) | ~8 min | ~5 min |
| INT4 g128 perplexity | 5.53 | 5.51 |
| act_order support | Yes (better quality, slower kernel) | No |
| Marlin kernel compatible | Yes (without act_order) | Yes |
| Supports asymmetric | Yes | Yes |
| Memory during quantization | High (stores H^{-1}) | Low (grid search) |