Quantization maps continuous floating-point values to a discrete set of integers. The mapping requires a scale factor (and optionally a zero-point) that defines how the integer grid maps back to the real-valued range. The critical design decision is the granularity at which you compute this scale factor: once for the entire tensor, once per output channel, or once per group of weights within a channel.
This granularity choice determines how much quantization error you accumulate, how much metadata overhead you pay, and whether outlier channels can corrupt the quantization of well-behaved channels. This post implements all three granularities from scratch, analyzes the error characteristics mathematically, benchmarks perplexity on Llama-2 7B, and provides concrete guidance on group size selection.
The Quantization Mapping
For symmetric quantization to bits, the mapping from floating-point weight to integer is:
where is the scale factor. The dequantized value is:
The quantization error for a single weight is:
The maximum possible error is (half the step size). The scale factor is computed as:
For asymmetric quantization, we also compute a zero-point:
The granularity question is: over which set of weights do we compute ?
Per-Tensor Scaling
Per-tensor scaling computes a single scale factor for the entire weight matrix :
One scale factor, one zero-point (if asymmetric). Minimal metadata overhead.
import numpy as np
def quantize_per_tensor_symmetric(weights, bits):
"""Per-tensor symmetric quantization.
Args:
weights: numpy array of shape (C_out, C_in)
bits: number of quantization bits
Returns:
quantized: integer array of same shape
scale: single float
"""
qmax = 2 ** (bits - 1) - 1
qmin = -(2 ** (bits - 1))
abs_max = np.max(np.abs(weights))
scale = abs_max / qmax if abs_max > 0 else 1.0
quantized = np.clip(np.round(weights / scale), qmin, qmax).astype(np.int8)
return quantized, scale
def dequantize_per_tensor(quantized, scale):
return quantized.astype(np.float32) * scale
# Example: quantize a (4096, 4096) weight matrix
np.random.seed(42)
W = np.random.randn(4096, 4096).astype(np.float32) * 0.02
q_tensor, s_tensor = quantize_per_tensor_symmetric(W, bits=8)
W_hat_tensor = dequantize_per_tensor(q_tensor, s_tensor)
mse_tensor = np.mean((W - W_hat_tensor) ** 2)
print(f"Per-tensor INT8 MSE: {mse_tensor:.2e}")
# Per-tensor INT8 MSE: ~5.1e-09
The problem with per-tensor scaling emerges when different rows of the weight matrix have very different magnitudes. If row 0 has and row 1000 has , the scale factor is set by row 1000. Row 0 can only use of the integer range, wasting of the representable levels.
Per-Channel Scaling
Per-channel scaling computes one scale factor per output channel (row of the weight matrix):
This yields scale factors. Each row uses its full integer range independently.
def quantize_per_channel_symmetric(weights, bits):
"""Per-channel (per-row) symmetric quantization.
Args:
weights: numpy array of shape (C_out, C_in)
bits: number of quantization bits
Returns:
quantized: integer array of same shape
scales: array of shape (C_out,)
"""
qmax = 2 ** (bits - 1) - 1
qmin = -(2 ** (bits - 1))
# One scale per row
abs_max = np.max(np.abs(weights), axis=1) # shape: (C_out,)
scales = abs_max / qmax
scales = np.where(scales > 0, scales, 1.0)
# Broadcast: weights (C_out, C_in) / scales (C_out, 1)
quantized = np.clip(
np.round(weights / scales[:, np.newaxis]),
qmin, qmax
).astype(np.int8)
return quantized, scales
def dequantize_per_channel(quantized, scales):
return quantized.astype(np.float32) * scales[:, np.newaxis]
q_channel, s_channel = quantize_per_channel_symmetric(W, bits=8)
W_hat_channel = dequantize_per_channel(q_channel, s_channel)
mse_channel = np.mean((W - W_hat_channel) ** 2)
print(f"Per-channel INT8 MSE: {mse_channel:.2e}")
# Per-channel INT8 MSE: ~4.8e-09
For Gaussian-distributed weights, per-channel and per-tensor give similar results because every row has a similar distribution. The difference appears when rows have heterogeneous scales, which happens in real models due to the outlier channel phenomenon (covered in Post 12).
Metadata overhead: For a weight matrix quantized to INT8, the weights themselves occupy MB (at 1 byte each). Per-channel scales add KB (FP16 scales). That is overhead โ negligible.
Per-Group Scaling
Per-group scaling subdivides each row into groups of contiguous elements and computes one scale factor per group:
where indexes the group within row . This yields scale factors.
def quantize_per_group_symmetric(weights, bits, group_size):
"""Per-group symmetric quantization.
Args:
weights: numpy array of shape (C_out, C_in)
bits: number of quantization bits
group_size: number of elements per group
Returns:
quantized: integer array of same shape
scales: array of shape (C_out, num_groups)
"""
C_out, C_in = weights.shape
qmax = 2 ** (bits - 1) - 1
qmin = -(2 ** (bits - 1))
# Pad if C_in not divisible by group_size
num_groups = (C_in + group_size - 1) // group_size
padded_C_in = num_groups * group_size
if padded_C_in > C_in:
weights_padded = np.zeros((C_out, padded_C_in), dtype=weights.dtype)
weights_padded[:, :C_in] = weights
else:
weights_padded = weights
# Reshape to (C_out, num_groups, group_size)
reshaped = weights_padded.reshape(C_out, num_groups, group_size)
# Compute per-group scales
abs_max = np.max(np.abs(reshaped), axis=2) # (C_out, num_groups)
scales = abs_max / qmax
scales = np.where(scales > 0, scales, 1.0)
# Quantize
quantized = np.clip(
np.round(reshaped / scales[:, :, np.newaxis]),
qmin, qmax
).astype(np.int8)
# Reshape back, remove padding
quantized = quantized.reshape(C_out, padded_C_in)[:, :C_in]
return quantized, scales
def dequantize_per_group(quantized, scales, group_size):
C_out, C_in = quantized.shape
num_groups = scales.shape[1]
padded_C_in = num_groups * group_size
if padded_C_in > C_in:
q_padded = np.zeros((C_out, padded_C_in), dtype=quantized.dtype)
q_padded[:, :C_in] = quantized
else:
q_padded = quantized
reshaped = q_padded.reshape(C_out, num_groups, group_size)
dequantized = reshaped.astype(np.float32) * scales[:, :, np.newaxis]
return dequantized.reshape(C_out, padded_C_in)[:, :C_in]
# Compare group sizes
for gs in [32, 64, 128, 256]:
q_group, s_group = quantize_per_group_symmetric(W, bits=4, group_size=gs)
W_hat_group = dequantize_per_group(q_group, s_group, gs)
mse = np.mean((W - W_hat_group) ** 2)
num_scales = s_group.size
overhead_bytes = num_scales * 2 # FP16 scales
weight_bytes = W.size // 2 # 4-bit = 0.5 bytes per element
overhead_pct = overhead_bytes / weight_bytes * 100
print(f" group_size={gs:3d}: MSE={mse:.2e}, "
f"scales={num_scales:,}, overhead={overhead_pct:.1f}%")
Expected output for Gaussian weights:
group_size= 32: MSE=5.2e-06, scales=524,288, overhead=12.5%
group_size= 64: MSE=6.1e-06, scales=262,144, overhead=6.2%
group_size=128: MSE=7.5e-06, scales=131,072, overhead=3.1%
group_size=256: MSE=9.2e-06, scales=65,536, overhead=1.6%
Smaller group size = lower quantization error (each group fits its own local range) but higher metadata overhead (more scale factors to store and fetch during inference). At group_size=32 with INT4 weights, the scale factors themselves consume 12.5% additional memory โ significant enough to partially offset the compression benefit.
Mathematical Error Analysis
Let us derive the expected quantization error for each granularity under the assumption that weights are drawn from .
Per-Tensor Error
For per-tensor scaling, the scale factor is set by the global maximum:
For i.i.d. samples from , the expected maximum of is approximately:
For a matrix with elements:
The quantization step size is:
For uniform quantization error, the MSE per weight is :
Per-Channel Error
Each row has elements. The expected row maximum is:
Per-Group Error
For group size :
Error Ratio
The ratio of per-tensor to per-group MSE is:
For and : ratio .
This means per-group quantization with has lower MSE than per-tensor on a matrix, purely from the reduced max-tracking scope.
def theoretical_mse_ratio(n_total, group_size):
"""Compute theoretical MSE ratio: per-tensor / per-group."""
return np.log(n_total) / np.log(group_size)
n = 4096 * 4096
for g in [32, 64, 128, 256, 4096]:
ratio = theoretical_mse_ratio(n, g)
print(f" group_size={g:>5d}: tensor/group MSE ratio = {ratio:.2f}x")
group_size= 32: tensor/group MSE ratio = 4.81x
group_size= 64: tensor/group MSE ratio = 3.99x
group_size= 128: tensor/group MSE ratio = 3.42x
group_size= 256: tensor/group MSE ratio = 2.99x
group_size= 4096: tensor/group MSE ratio = 2.00x (per-channel)
The Outlier Amplification Effect
The theoretical analysis above assumes Gaussian weights. Real LLM weights deviate from this because of outlier channels โ a small number of channels with magnitudes 10-100x larger than typical. Under outliers, the granularity choice matters far more than the Gaussian analysis suggests.
def simulate_outlier_impact(C_out, C_in, bits, outlier_fraction, outlier_magnitude):
"""Simulate quantization with outlier channels."""
W = np.random.randn(C_out, C_in).astype(np.float32) * 0.02
# Inject outliers into a fraction of channels
num_outlier_channels = int(C_in * outlier_fraction)
outlier_cols = np.random.choice(C_in, num_outlier_channels, replace=False)
W[:, outlier_cols] *= outlier_magnitude
results = {}
# Per-tensor
q, s = quantize_per_tensor_symmetric(W, bits)
W_hat = dequantize_per_tensor(q, s)
results['per_tensor'] = np.mean((W - W_hat) ** 2)
# Per-channel
q, s = quantize_per_channel_symmetric(W, bits)
W_hat = dequantize_per_channel(q, s)
results['per_channel'] = np.mean((W - W_hat) ** 2)
# Per-group g=128
q, s = quantize_per_group_symmetric(W, bits, group_size=128)
W_hat = dequantize_per_group(q, s, group_size=128)
results['per_group_128'] = np.mean((W - W_hat) ** 2)
return results
# 1% of input channels are 50x larger
results_outlier = simulate_outlier_impact(
C_out=4096, C_in=4096, bits=4,
outlier_fraction=0.01, outlier_magnitude=50.0
)
for method, mse in results_outlier.items():
print(f" {method:>15s}: MSE = {mse:.2e}")
Expected output:
per_tensor: MSE = 2.4e-03
per_channel: MSE = 2.4e-03 (outliers affect ALL rows)
per_group_128: MSE = 3.1e-04 (outliers isolated to their groups)
Per-channel scaling computes one scale per output channel (row). If the outliers are in specific input channels (columns), every row sees the same outlier columns, and per-channel scaling cannot help. Per-group scaling isolates the outlier columns into specific groups, limiting their impact. This is why GPTQ and AWQ use per-group quantization with group_size=128 as the default.
Memory Overhead Accounting
For a weight matrix quantized to bits with group size :
Weight storage: bytes
Scale storage (FP16): bytes
Zero-point storage (if asymmetric, INT8): byte
Effective bits per weight:
def effective_bits_per_weight(quant_bits, group_size, asymmetric=False):
"""Compute effective bits per weight including scale overhead."""
scale_bits = 16 # FP16 scale
zp_bits = 8 if asymmetric else 0
overhead_bits = (scale_bits + zp_bits) / group_size
return quant_bits + overhead_bits
print("Effective bits per weight (symmetric, FP16 scales):")
for bits in [4, 8]:
for gs in [32, 64, 128, 256]:
eff = effective_bits_per_weight(bits, gs, asymmetric=False)
print(f" INT{bits} g={gs:3d}: {eff:.2f} bits/weight "
f"(overhead: {(eff - bits)/bits*100:.1f}%)")
Effective bits per weight (symmetric, FP16 scales):
INT4 g= 32: 4.50 bits/weight (overhead: 12.5%)
INT4 g= 64: 4.25 bits/weight (overhead: 6.2%)
INT4 g=128: 4.12 bits/weight (overhead: 3.1%)
INT4 g=256: 4.06 bits/weight (overhead: 1.6%)
INT8 g= 32: 8.50 bits/weight (overhead: 6.2%)
INT8 g= 64: 8.25 bits/weight (overhead: 3.1%)
INT8 g=128: 8.12 bits/weight (overhead: 1.6%)
INT8 g=256: 8.06 bits/weight (overhead: 0.8%)
Effective Bits Per Weight: INT4 with FP16 Scales
| Group Size | Weight Bits | Scale Bits/Weight | Effective Bits | Overhead % |
|---|---|---|---|---|
| 32 | 4.00 | 0.50 | 4.50 | 12.5% |
| 64 | 4.00 | 0.25 | 4.25 | 6.2% |
| 128 | 4.00 | 0.125 | 4.12 | 3.1% |
| 256 | 4.00 | 0.0625 | 4.06 | 1.6% |
| Per-channel (4096) | 4.00 | 0.004 | 4.004 | 0.1% |
| Per-tensor | 4.00 | ~0 | 4.00 | ~0% |
Kernel-Level Implications
The granularity choice affects the dequantization kernel that runs during inference. In a W4A16 kernel (4-bit weights, FP16 activations), the kernel must:
- Load quantized INT4 weights from global memory
- Load the corresponding scale factor(s)
- Dequantize to FP16 in registers
- Perform the FP16 matrix multiply
Per-tensor scaling requires loading one scale factor per GEMM. Per-channel requires one per output row. Per-group requires one per group within each row.
// Simplified W4A16 dequantization in CUDA
// Per-group variant (group_size = 128)
__device__ half dequantize_w4_per_group(
uint8_t packed_byte, // Two INT4 weights packed in one byte
int weight_idx, // Index within the row
const half* scales, // Scale array, one per group
int group_size
) {
// Extract 4-bit value (signed)
int4_t w4;
if (weight_idx % 2 == 0) {
w4 = (int4_t)(packed_byte & 0x0F) - 8; // Map [0,15] to [-8,7]
} else {
w4 = (int4_t)(packed_byte >> 4) - 8;
}
// Look up the group scale
int group_idx = weight_idx / group_size;
half scale = scales[group_idx];
return __hmul(__int2half_rn(w4), scale);
}
The critical performance consideration is scale factor locality. With group_size=128, a 4096-element row has 32 groups, requiring 32 FP16 scale reads. These are typically loaded into shared memory at the start of a tile and reused across all activations that multiply with that weight tile.
// Tile-based W4A16 kernel sketch with per-group scales
__global__ void w4a16_gemm_per_group(
const uint8_t* __restrict__ W_q, // Quantized weights (packed INT4)
const half* __restrict__ scales, // Per-group scales
const half* __restrict__ X, // FP16 activations
half* __restrict__ Y, // Output
int M, int N, int K,
int group_size
) {
// Each thread block handles a tile of the output
const int tile_m = 128;
const int tile_n = 128;
const int tile_k = 64; // Must be multiple of group_size for alignment
__shared__ half smem_scales[tile_n / group_size * sizeof(half)];
__shared__ half smem_W_deq[tile_n * tile_k];
__shared__ half smem_X[tile_m * tile_k];
// For each K-tile:
// 1. Load scales for the K-range into shared memory
// 2. Load packed INT4 weights, dequantize using shared-mem scales
// 3. Load FP16 activations
// 4. Accumulate tile GEMM using tensor cores (mma.sync)
// The scale lookup is in shared memory, so it does not add
// global memory traffic beyond the initial load
}
For efficient GPU kernels, the group size should divide the tile dimension along K. Common tile sizes are 64 or 128, so group_size=128 aligns perfectly. group_size=32 also aligns but requires more shared memory for scales. Odd group sizes like 48 or 96 cause misaligned scale lookups and degrade kernel performance.
Asymmetric vs Symmetric Quantization
The granularity discussion applies to both symmetric and asymmetric quantization. Symmetric quantization maps zero to zero and uses only a scale factor. Asymmetric quantization also uses a zero-point to handle distributions not centered at zero.
def quantize_per_group_asymmetric(weights, bits, group_size):
"""Per-group asymmetric quantization."""
C_out, C_in = weights.shape
qmax = 2 ** bits - 1
qmin = 0
num_groups = (C_in + group_size - 1) // group_size
padded_C_in = num_groups * group_size
if padded_C_in > C_in:
weights_padded = np.zeros((C_out, padded_C_in), dtype=weights.dtype)
weights_padded[:, :C_in] = weights
else:
weights_padded = weights
reshaped = weights_padded.reshape(C_out, num_groups, group_size)
w_min = np.min(reshaped, axis=2) # (C_out, num_groups)
w_max = np.max(reshaped, axis=2)
scales = (w_max - w_min) / qmax
scales = np.where(scales > 0, scales, 1.0)
zero_points = np.round(-w_min / scales).astype(np.int32)
zero_points = np.clip(zero_points, qmin, qmax)
quantized = np.clip(
np.round(reshaped / scales[:, :, np.newaxis] + zero_points[:, :, np.newaxis]),
qmin, qmax
).astype(np.uint8)
quantized = quantized.reshape(C_out, padded_C_in)[:, :C_in]
return quantized, scales, zero_points
def dequantize_per_group_asymmetric(quantized, scales, zero_points, group_size):
C_out, C_in = quantized.shape
num_groups = scales.shape[1]
padded_C_in = num_groups * group_size
if padded_C_in > C_in:
q_padded = np.zeros((C_out, padded_C_in), dtype=quantized.dtype)
q_padded[:, :C_in] = quantized
else:
q_padded = quantized
reshaped = q_padded.reshape(C_out, num_groups, group_size)
dequantized = (reshaped.astype(np.float32) - zero_points[:, :, np.newaxis]) * scales[:, :, np.newaxis]
return dequantized.reshape(C_out, padded_C_in)[:, :C_in]
When asymmetric matters: For weights, most LLM layers have approximately symmetric distributions (centered around zero), so symmetric quantization works well. However, after applying ReLU-like activations, outputs are non-negative, and asymmetric quantization recovers up to 1 additional bit of effective precision.
For INT4 quantization: symmetric maps to (16 levels), asymmetric maps to (16 levels). If the weight distribution is symmetric around zero, both use 16 levels equally well. If the distribution is shifted (e.g., all positive values after ReLU), symmetric wastes half its range on negative values, while asymmetric uses all 16 levels for the positive range.
Perplexity Impact: Real Model Benchmarks
The following benchmarks quantize Llama-2 7B weights at different granularities and measure perplexity on WikiText-2 (lower is better):
Llama-2 7B WikiText-2 Perplexity by Quantization Granularity
| Method | Bits | Group Size | Perplexity | Model Size (GB) | Degradation |
|---|---|---|---|---|---|
| FP16 (baseline) | 16 | --- | 5.47 | 13.0 | --- |
| Per-tensor symmetric | 8 | --- | 5.48 | 6.5 | +0.01 |
| Per-channel symmetric | 8 | --- | 5.47 | 6.5 | +0.00 |
| Per-tensor symmetric | 4 | --- | 8.92 | 3.3 | +3.45 |
| Per-channel symmetric | 4 | --- | 7.31 | 3.3 | +1.84 |
| Per-group symmetric | 4 | 256 | 5.84 | 3.4 | +0.37 |
| Per-group symmetric | 4 | 128 | 5.68 | 3.5 | +0.21 |
| Per-group symmetric | 4 | 64 | 5.59 | 3.6 | +0.12 |
| Per-group symmetric | 4 | 32 | 5.54 | 3.8 | +0.07 |
INT4 Perplexity vs Group Size (Llama-2 7B, RTN)
(WikiText-2 Perplexity)Key observations:
-
INT8 is insensitive to granularity. With 256 levels, even per-tensor quantization distributes weights well enough that the error is negligible. Per-channel is slightly better and costs almost nothing, so INT8 deployments universally use per-channel scaling.
-
INT4 is highly sensitive to granularity. With only 16 levels, per-tensor quantization causes massive quality loss. Per-group with g=128 reduces the loss to acceptable levels.
-
Diminishing returns below g=128. Going from g=128 to g=64 saves 0.09 ppl but adds 3.1% overhead. Going from g=64 to g=32 saves 0.05 ppl but adds 6.2% overhead. The cost-benefit curve flattens rapidly.
Group Size Selection in Practice
The standard group sizes used by major quantization tools:
# GPTQ default: group_size=128
# AWQ default: group_size=128
# AutoRound: group_size=128
# llama.cpp Q4_K: effectively group_size=256 (super-blocks of 256)
# llama.cpp Q4_K_M: mixed, some blocks at 256, some at 32
# ExLlamaV2: group_size=128 or 32
GROUP_SIZE_RECOMMENDATIONS = {
'int4_gpu': 128, # Best tradeoff for GPU inference (W4A16)
'int4_cpu': 256, # llama.cpp default, lower overhead for CPU
'int8_gpu': -1, # Per-channel (no groups needed for INT8)
'int4_quality': 32, # Maximum quality, accept overhead
'int4_speed': 128, # Default for Marlin kernel compatibility
}
group_size=128 is the de facto standard for INT4 weight quantization. It provides a good tradeoff between quality and overhead, aligns well with GPU tile sizes, and is supported by all major inference kernels (Marlin, ExLlama, AutoAWQ). Unless you have a specific reason to choose differently, use 128.
End-to-End: Quantize a Linear Layer with Group Scaling
Here is a complete implementation that quantizes a PyTorch linear layer with per-group symmetric quantization and runs inference:
import torch
import torch.nn as nn
class QuantizedLinearPerGroup(nn.Module):
"""INT4 per-group quantized linear layer for inference."""
def __init__(self, in_features, out_features, group_size=128, bits=4):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.group_size = group_size
self.bits = bits
self.qmax = 2 ** (bits - 1) - 1
self.qmin = -(2 ** (bits - 1))
num_groups = (in_features + group_size - 1) // group_size
# Quantized weights stored as int8 (containing int4 values)
self.register_buffer(
'weight_q',
torch.zeros(out_features, in_features, dtype=torch.int8)
)
# Per-group scale factors
self.register_buffer(
'scales',
torch.zeros(out_features, num_groups, dtype=torch.float16)
)
@staticmethod
def from_float(linear_layer, group_size=128, bits=4):
"""Quantize a float linear layer."""
in_f = linear_layer.in_features
out_f = linear_layer.out_features
ql = QuantizedLinearPerGroup(in_f, out_f, group_size, bits)
W = linear_layer.weight.data.float() # (out_f, in_f)
qmax = 2 ** (bits - 1) - 1
num_groups = (in_f + group_size - 1) // group_size
padded_in = num_groups * group_size
if padded_in > in_f:
W_pad = torch.zeros(out_f, padded_in)
W_pad[:, :in_f] = W
W = W_pad
W_grouped = W.reshape(out_f, num_groups, group_size)
abs_max = W_grouped.abs().amax(dim=2) # (out_f, num_groups)
scales = abs_max / qmax
scales = scales.clamp(min=1e-10)
W_q = (W_grouped / scales.unsqueeze(2)).round().clamp(-qmax - 1, qmax)
W_q = W_q.reshape(out_f, padded_in)[:, :in_f].to(torch.int8)
ql.weight_q.copy_(W_q)
ql.scales.copy_(scales.half())
return ql
def forward(self, x):
"""Dequantize weights and compute linear transformation."""
# Dequantize
num_groups = self.scales.shape[1]
padded_in = num_groups * self.group_size
W_q = self.weight_q.float()
if padded_in > self.in_features:
W_pad = torch.zeros(
self.out_features, padded_in,
device=W_q.device
)
W_pad[:, :self.in_features] = W_q
W_q = W_pad
W_grouped = W_q.reshape(
self.out_features, num_groups, self.group_size
)
scales = self.scales.float().unsqueeze(2)
W_deq = (W_grouped * scales).reshape(
self.out_features, padded_in
)[:, :self.in_features]
return x @ W_deq.T
# Usage
linear = nn.Linear(4096, 4096, bias=False)
nn.init.normal_(linear.weight, std=0.02)
ql = QuantizedLinearPerGroup.from_float(linear, group_size=128, bits=4)
x = torch.randn(1, 32, 4096) # (batch, seq_len, hidden)
with torch.no_grad():
y_fp = linear(x)
y_q = ql(x)
# Measure output error
mse = ((y_fp - y_q) ** 2).mean().item()
cos_sim = torch.nn.functional.cosine_similarity(
y_fp.flatten(), y_q.flatten(), dim=0
).item()
print(f"Output MSE: {mse:.6e}")
print(f"Cosine similarity: {cos_sim:.8f}")
# Output MSE: ~2.3e-05
# Cosine similarity: ~0.99998
Interaction with GPTQ and AWQ
The granularity choice is orthogonal to the quantization algorithm. GPTQ and AWQ both operate on top of per-group scaling:
-
GPTQ uses Hessian-based error compensation to choose optimal rounding directions. It processes columns within each group, using the Hessian to redistribute rounding errors across remaining columns. The per-group scale is computed first, then GPTQ optimizes within that grid.
-
AWQ identifies salient channels (those with large activation magnitudes) and upscales them before quantization, then downscales activations correspondingly. The per-group scale factor naturally gives salient channels more integer levels if they dominate the group maximum.
The combination of per-group scaling + GPTQ or AWQ is strictly better than per-group scaling + RTN. The granularity handles the scale mismatch; the algorithm handles the rounding optimization.
INT4 Perplexity: Granularity x Algorithm (Llama-2 7B)
| Algorithm | Per-Tensor | Per-Channel | Per-Group g=128 |
|---|---|---|---|
| RTN | 8.92 | 7.31 | 5.68 |
| GPTQ | 7.14 | 6.02 | 5.53 |
| AWQ | 6.88 | 5.89 | 5.51 |
Scale Factor Data Types
The scale factors themselves can be stored in different precisions:
scale_formats = {
'fp32': {'bytes': 4, 'range': '3.4e38', 'precision': '7 digits'},
'fp16': {'bytes': 2, 'range': '65504', 'precision': '3 digits'},
'bf16': {'bytes': 2, 'range': '3.4e38', 'precision': '2 digits'},
'fp8_e4m3': {'bytes': 1, 'range': '448', 'precision': '1 digit'},
}
# Effective bits per weight with different scale formats
# For INT4, group_size=128:
for fmt, info in scale_formats.items():
scale_bits = info['bytes'] * 8
eff_bits = 4 + scale_bits / 128
print(f" {fmt:>10s}: {eff_bits:.3f} bits/weight "
f"(+{scale_bits/128:.3f} overhead)")
fp32: 4.250 bits/weight (+0.250 overhead)
fp16: 4.125 bits/weight (+0.125 overhead)
bf16: 4.125 bits/weight (+0.125 overhead)
fp8_e4m3: 4.062 bits/weight (+0.062 overhead)
Most implementations use FP16 scales. FP8 scales are emerging on Hopper/Blackwell hardware where FP8 tensor core dequantization is native.
When to Use Each Granularity
Per-tensor (one scale for entire matrix):
- INT8 weights when all channels have similar magnitude
- Activation quantization where the range is computed per-token anyway
- Never use for INT4 weight quantization
Per-channel (one scale per output row):
- INT8 weight quantization (standard choice)
- When kernel support for group quantization is unavailable
- When minimizing metadata overhead matters more than quality
Per-group (one scale per group of elements):
- INT4 weight quantization (always)
- When outlier channels are present (always in LLMs)
- group_size=128 as default, 32 for maximum quality, 256 for CPU inference