At 4 bits per weight, GPTQ and AWQ deliver near-lossless quantization for most models above 7B parameters. At 3 bits, quality starts to degrade but remains usable. At 2 bits per weight, naive uniform quantization completely destroys the model. A 70B model in 2-bit uniform quantization produces incoherent text. Yet a 70B model at 2 bits per weight would fit in under 20 GB of memory, enabling single-GPU deployment on a consumer RTX 4090.
AQLM (Additive Quantization of Language Models) achieves usable 2-bit quantization through a fundamentally different approach: instead of quantizing each weight independently to one of 4 levels (which is what 2-bit uniform quantization does), AQLM groups weights into vectors and maps each vector to the nearest entry in a learned codebook. This is vector quantization, and it exploits the correlation structure between weights to achieve much better compression than element-wise quantization.
The Vector Quantization Formulation
Why Element-Wise Quantization Fails at 2 Bits
With 2 bits per weight, you have exactly 4 representable values. For symmetric uniform quantization:
where is the scale factor. This means every weight in the model is approximated by one of 4 values. The maximum relative error for any weight is bounded by:
For weights near zero, the relative error explodes. For a weight of magnitude with scale , the quantization maps it to either or --- a 100x distortion.
# Demonstration: 2-bit uniform quantization of a weight matrix
import torch
def quantize_uniform_2bit(W, symmetric=True):
"""Quantize each weight independently to 2 bits (4 levels)."""
scale = W.abs().max() / 1.5 # For levels {-1.5s, -0.5s, 0.5s, 1.5s}
W_q = torch.round(W / scale).clamp(-2, 1) # Map to {-2, -1, 0, 1}
W_dq = W_q * scale
# MSE
mse = ((W - W_dq) ** 2).mean()
return W_dq, mse
# For a typical LLM weight matrix (4096 x 4096):
W = torch.randn(4096, 4096) * 0.02 # Normal-ish LLM weights
_, mse_2bit = quantize_uniform_2bit(W)
# mse_2bit ~ 3.3e-5 (very high relative to weight variance of 4e-4)
# Signal-to-quantization-noise ratio: ~10.8 dB (terrible)
AQLM: Additive Multi-Codebook Vector Quantization
AQLM groups consecutive weights into vectors of size (typically 8) and represents each vector as a sum of entries from multiple codebooks:
where is the number of codebooks, is the -th codebook (a matrix of vectors, each of dimension ), and is the index into the -th codebook.
# AQLM representation
# Weight matrix W: [out_features, in_features]
# Group consecutive in_features into vectors of size d
class AQLMLayer:
def __init__(self, out_features, in_features, num_codebooks=2,
codebook_size=256, group_size=8):
self.out_features = out_features
self.in_features = in_features
self.num_codebooks = num_codebooks # M
self.codebook_size = codebook_size # K (typically 256 = 8 bits per index)
self.group_size = group_size # d
self.num_groups = in_features // group_size # Number of vectors per row
# Codebooks: M codebooks, each with K vectors of dimension d
# Shape: [M, K, d]
self.codebooks = torch.zeros(num_codebooks, codebook_size, group_size)
# Indices: for each output row and each group, M indices into codebooks
# Shape: [out_features, num_groups, M]
self.indices = torch.zeros(out_features, self.num_groups, num_codebooks,
dtype=torch.int16)
# Per-output-channel scales (optional)
self.scales = torch.ones(out_features)
def decode_weights(self):
"""Reconstruct the weight matrix from codebooks and indices."""
W = torch.zeros(self.out_features, self.in_features)
for row in range(self.out_features):
for g in range(self.num_groups):
vec = torch.zeros(self.group_size)
for m in range(self.num_codebooks):
idx = self.indices[row, g, m]
vec += self.codebooks[m, idx] # Additive combination
W[row, g*self.group_size:(g+1)*self.group_size] = vec
W[row] *= self.scales[row]
return W
# Bits per weight calculation:
# Each group of d weights is encoded as M indices, each log2(K) bits
# bits_per_weight = M * log2(K) / d + codebook_overhead
#
# Example: M=2 codebooks, K=256 (8-bit indices), d=8 (group size)
# bits_per_weight = 2 * 8 / 8 = 2.0 bits (plus negligible codebook overhead)
#
# Example: M=1 codebook, K=65536 (16-bit indices), d=8
# bits_per_weight = 1 * 16 / 8 = 2.0 bits
# Same effective bits but single codebook is less expressive than additive
The additive structure is critical. A single codebook with entries can represent distinct vectors. Two additive codebooks with each can represent distinct sums, using the same number of bits. But the additive structure is far more parameter-efficient: codebook parameters vs codebook parameters.
Why Vector Quantization Works Better
Vector quantization exploits the joint distribution of weights within each group. Element-wise quantization treats each weight independently, discarding correlation information.
# Intuition: consider 2 weights (w1, w2)
# Element-wise 2-bit: 4 x 4 = 16 possible (w1_q, w2_q) pairs
# Vector 4-bit on (w1, w2): 16 possible (w1_q, w2_q) pairs
# Same number of states, but:
# - Element-wise: states are on a regular grid
# - Vector: states can be placed anywhere in 2D space
#
# If weights are correlated (as they are in trained neural nets),
# the codebook entries cluster in the populated region of weight space,
# reducing average distortion.
# Rate-distortion theory gives the theoretical advantage:
# For a d-dimensional Gaussian source at rate R bits per symbol,
# the distortion of optimal vector quantization approaches:
# D_VQ = (1 / (2 * pi * e)) * 2^(-2R) as d -> infinity
# vs scalar quantization:
# D_SQ = (pi * e / 6) * 2^(-2R)
#
# The ratio D_SQ / D_VQ = (pi * e)^2 / 6 ≈ 4.35
# Vector quantization can be 4.35x better in MSE for Gaussian sources.
# For real weight distributions (non-Gaussian, correlated), the gain is larger.
The AQLM Optimization Algorithm
Objective Function
AQLM minimizes the layer-wise reconstruction error weighted by the Hessian:
where denotes the codebook parameters and denotes the index assignments. This is equivalent to:
The Hessian weights the error by the input correlations, ensuring that errors on high-activation dimensions are penalized more.
Alternating Optimization
AQLM alternates between two steps:
# Step 1: Fix codebooks C, optimize indices I (assignment step)
# For each weight vector, find the best combination of codebook entries
def optimize_indices(W_group, codebooks, H_group):
"""
W_group: [out_features, d] - one group of weight vectors
codebooks: [M, K, d] - M codebooks, K entries each, dimension d
H_group: [d, d] - Hessian block for this group
Find indices i_1, ..., i_M for each row that minimize:
(W - sum_m C_m[i_m])^T H (W - sum_m C_m[i_m])
"""
M, K, d = codebooks.shape
out_features = W_group.shape[0]
best_indices = torch.zeros(out_features, M, dtype=torch.long)
# Beam search: maintain top-B candidates
beam_width = 16 # Number of candidates to keep
for row in range(out_features):
w = W_group[row] # [d]
# Initialize beam with best single-codebook assignments
# For codebook 0:
residuals = w.unsqueeze(0) - codebooks[0] # [K, d]
costs = (residuals @ H_group * residuals).sum(dim=1) # [K]
topk_costs, topk_indices = costs.topk(beam_width, largest=False)
beam = []
for b in range(beam_width):
beam.append({
'indices': [topk_indices[b].item()],
'residual': w - codebooks[0, topk_indices[b]],
'cost': topk_costs[b].item()
})
# Extend beam for codebooks 1, ..., M-1
for m in range(1, M):
candidates = []
for entry in beam:
res = entry['residual']
# Try all K entries in codebook m
new_residuals = res.unsqueeze(0) - codebooks[m] # [K, d]
new_costs = (new_residuals @ H_group * new_residuals).sum(dim=1)
topk_c, topk_i = new_costs.topk(
min(beam_width, K), largest=False
)
for j in range(min(beam_width, K)):
candidates.append({
'indices': entry['indices'] + [topk_i[j].item()],
'residual': res - codebooks[m, topk_i[j]],
'cost': topk_c[j].item()
})
# Keep top beam_width candidates
candidates.sort(key=lambda x: x['cost'])
beam = candidates[:beam_width]
best_indices[row] = torch.tensor(beam[0]['indices'])
return best_indices
# Step 2: Fix indices I, optimize codebooks C (codebook update step)
# This is a least-squares problem for each codebook entry
def optimize_codebooks(W_groups, indices, H_groups, M, K, d):
"""
Given fixed index assignments, find optimal codebook vectors
by solving a weighted least-squares problem.
For codebook m, entry k, the optimal vector c_m[k] minimizes:
sum_{rows assigned to k} (w_row - sum_{m'!=m} c_m'[i_m'] - c_m[k])^T H (...)
"""
codebooks = torch.zeros(M, K, d)
for m in range(M):
for k in range(K):
# Find all weight vectors that use entry k in codebook m
# Compute their residuals (after subtracting other codebook contributions)
# Solve weighted least squares: H @ c = H @ mean_residual
assigned_residuals = []
assigned_H = []
for row_idx, group_idx in enumerate_assignments(indices, m, k):
w = W_groups[group_idx][row_idx]
# Residual: w minus contributions from other codebooks
residual = w.clone()
for m_other in range(M):
if m_other != m:
idx_other = indices[row_idx, group_idx, m_other]
residual -= codebooks[m_other, idx_other]
assigned_residuals.append(residual)
assigned_H.append(H_groups[group_idx])
if len(assigned_residuals) > 0:
# Weighted average: c_m[k] = (sum H_i)^{-1} @ (sum H_i @ r_i)
H_sum = torch.stack(assigned_H).sum(dim=0)
Hr_sum = sum(H @ r for H, r in zip(assigned_H, assigned_residuals))
codebooks[m, k] = torch.linalg.solve(
H_sum + 1e-6 * torch.eye(d), Hr_sum
)
return codebooks
The beam search in the index optimization step is the computational bottleneck. With codebooks, entries, and beam width , each weight vector requires evaluating candidate combinations. For a 7B model with roughly weight groups, this is evaluations. AQLM calibration takes 4-20 hours on 8 A100 GPUs depending on model size and number of optimization iterations.
Codebook Initialization and Training Schedule
K-Means Initialization
The initial codebook quality significantly affects the final result. AQLM uses K-means clustering on the weight vectors:
def initialize_codebooks_kmeans(W, num_codebooks, codebook_size, group_size):
"""
Initialize codebooks using K-means on weight vector groups.
For M > 1 codebooks, use residual initialization:
1. Run K-means on weight vectors -> codebook 0
2. Compute residuals (w - nearest codebook 0 entry)
3. Run K-means on residuals -> codebook 1
4. Repeat for codebook 2, etc.
"""
out_features, in_features = W.shape
num_groups = in_features // group_size
# Reshape to weight vectors
W_vectors = W.reshape(out_features * num_groups, group_size) # [N_vectors, d]
codebooks = torch.zeros(num_codebooks, codebook_size, group_size)
residuals = W_vectors.clone()
for m in range(num_codebooks):
# K-means on current residuals
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=codebook_size, n_init=3, max_iter=100)
kmeans.fit(residuals.cpu().numpy())
codebooks[m] = torch.from_numpy(kmeans.cluster_centers_)
# Update residuals
assignments = torch.from_numpy(kmeans.labels_)
for i in range(len(residuals)):
residuals[i] -= codebooks[m, assignments[i]]
return codebooks
# Training schedule:
# Initialize codebooks via residual K-means
# For t = 1, ..., T (T ~ 10-25 iterations):
# a. Optimize indices (beam search) - ~80% of compute
# b. Optimize codebooks (least squares) - ~15% of compute
# c. Fine-tune with straight-through estimator (optional) - ~5%
# Export quantized model with codebooks + indices
Fine-Tuning with Straight-Through Estimator
AQLM can optionally fine-tune the codebooks end-to-end using gradient descent with the straight-through estimator (STE):
def aqlm_finetune_step(model_aqlm, batch, optimizer):
"""
Fine-tune codebooks with straight-through gradient estimation.
The forward pass uses codebook lookup (discrete).
The backward pass passes gradients through as if the lookup were identity.
"""
# Forward: decode weights from codebooks, run model
for layer in model_aqlm.layers:
for linear in layer.linear_modules:
# Decode weight matrix from codebooks + indices
W_decoded = decode_from_codebooks(
linear.codebooks, # [M, K, d] - these are the learnable params
linear.indices, # [out, num_groups, M] - fixed after index opt
linear.scales # [out] - also learnable
)
linear.weight_for_forward = W_decoded
loss = model_aqlm(batch)
loss.backward()
# Gradients flow through to codebook entries via STE
# Only codebook vectors and scales are updated, not indices
optimizer.step()
optimizer.zero_grad()
Inference Kernel Design
The Codebook Lookup Kernel
At inference time, the AQLM kernel must decode weight vectors from codebook indices and compute the matrix multiply. The critical challenge is that codebook lookup replaces the simple memory read of a standard quantized kernel with an indexed gather operation.
// AQLM inference kernel: codebook-based dequantization + GEMM
// For each output element C[i,j]:
// C[i,j] = sum over groups g of:
// dot(X[j, g*d : (g+1)*d], sum_m codebook[m][ indices[i,g,m] ])
// Fused kernel: codebook lookup + dot product in shared memory
__global__ void aqlm_gemv_kernel(
const half* __restrict__ X, // [1, in_features] (single token)
const uint16_t* __restrict__ indices, // [out_features, num_groups, M]
const half* __restrict__ codebooks, // [M, K, d]
const half* __restrict__ scales, // [out_features]
half* __restrict__ output, // [1, out_features]
int out_features, int num_groups, int M, int K, int d
) {
// Each thread block handles one output row
int row = blockIdx.x;
if (row >= out_features) return;
// Load codebook into shared memory (fits for small K*d)
extern __shared__ half smem[];
half* codebook_smem = smem; // [M * K * d]
// Cooperative load of codebooks
int total_cb_elems = M * K * d;
for (int i = threadIdx.x; i < total_cb_elems; i += blockDim.x) {
codebook_smem[i] = codebooks[i];
}
__syncthreads();
// Compute dot product: sum over groups
float local_sum = 0.0f;
for (int g = threadIdx.x; g < num_groups; g += blockDim.x) {
// Decode weight vector for this group
half w_vec[8]; // Assuming d=8
for (int dd = 0; dd < d; dd++) w_vec[dd] = __float2half(0.0f);
for (int m = 0; m < M; m++) {
uint16_t idx = indices[row * num_groups * M + g * M + m];
for (int dd = 0; dd < d; dd++) {
w_vec[dd] = __hadd(w_vec[dd],
codebook_smem[m * K * d + idx * d + dd]);
}
}
// Dot product with input
for (int dd = 0; dd < d; dd++) {
float x_val = __half2float(X[g * d + dd]);
float w_val = __half2float(w_vec[dd]);
local_sum += x_val * w_val;
}
}
// Warp reduction
for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
local_sum += __shfl_down_sync(0xffffffff, local_sum, offset);
}
// Block reduction
__shared__ float warp_sums[32];
int warp_id = threadIdx.x / warpSize;
int lane_id = threadIdx.x % warpSize;
if (lane_id == 0) warp_sums[warp_id] = local_sum;
__syncthreads();
if (threadIdx.x == 0) {
float total = 0.0f;
for (int w = 0; w < (blockDim.x + warpSize - 1) / warpSize; w++) {
total += warp_sums[w];
}
output[row] = __float2half(total * __half2float(scales[row]));
}
}
Performance Characteristics
The codebook lookup introduces a fundamentally different memory access pattern compared to standard dequantization:
Standard INT4 dequantization (GPTQ/AWQ):
- Read: 4 bits per weight, contiguous
- Compute: shift, mask, multiply by scale
- Memory pattern: sequential, coalesced
- Arithmetic intensity: high (simple dequant)
AQLM codebook lookup:
- Read: 8-16 bit index per group of 8 weights
- Gather: codebook[index] (random access into codebook)
- Add: sum M codebook entries
- Memory pattern: random (codebook gather), non-coalesced
- Arithmetic intensity: lower (random access bottleneck)
If codebook fits in L1/L2 cache:
- Latency per lookup: ~30 cycles (L1 hit) or ~200 cycles (L2 hit)
- M=2 codebooks, K=256, d=8, FP16: 2 * 256 * 8 * 2 = 8 KB per codebook
- Total: 2 * 8 KB = 16 KB -> fits comfortably in 128 KB L1 per SM
If codebook does NOT fit in cache:
- Latency per lookup: ~500-800 cycles (HBM access)
- Performance degrades significantly
AQLM 2-bit vs GPTQ 4-bit Decode Throughput (Llama-2-7B, H100)
(tokens/sec)AQLM at 2 bits is slower than GPTQ at 4 bits despite using less memory, because the codebook gather operation has lower throughput than the simple bit-shift dequantization used by GPTQ. The memory bandwidth savings from 2-bit vs 4-bit weights are partially offset by the codebook access overhead. AQLM wins on memory capacity (fitting larger models on fewer GPUs) but not on raw throughput per GPU.
Quality Comparison at Extreme Compression
2-Bit Quantization Results
# Perplexity comparison at 2 bits per weight
# All methods use 128 calibration samples from WikiText-2
# Evaluated on WikiText-2 test set
results = {
"Llama-2-7B": {
"FP16": 5.47,
"GPTQ 2-bit": 12.8, # Barely functional
"RTN 2-bit": 142.0, # Complete failure
"AQLM 2-bit": 7.89, # Usable
"QuIP# 2-bit": 7.45, # Best at 2-bit
},
"Llama-2-13B": {
"FP16": 4.88,
"GPTQ 2-bit": 9.21,
"RTN 2-bit": 87.3,
"AQLM 2-bit": 6.34,
"QuIP# 2-bit": 6.02,
},
"Llama-2-70B": {
"FP16": 3.32,
"GPTQ 2-bit": 5.87,
"RTN 2-bit": 28.4,
"AQLM 2-bit": 4.21,
"QuIP# 2-bit": 3.98,
},
}
2-Bit Quantization Quality (WikiText-2 Perplexity, Lower is Better)
| Method | Llama-2-7B | Llama-2-13B | Llama-2-70B | Avg Degradation |
|---|---|---|---|---|
| FP16 (baseline) | 5.47 | 4.88 | 3.32 | 0 |
| RTN 2-bit | 142.0 | 87.3 | 28.4 | Destroyed |
| GPTQ 2-bit | 12.8 | 9.21 | 5.87 | +5.1 |
| AQLM 2-bit (M=2, K=256, d=8) | 7.89 | 6.34 | 4.21 | +1.6 |
| QuIP# 2-bit | 7.45 | 6.02 | 3.98 | +1.3 |
| GPTQ 4-bit (reference) | 5.62 | 4.97 | 3.41 | +0.12 |
Quality vs Bits-Per-Weight Frontier
AQLM and QuIP# define the Pareto frontier for extreme compression. Below 3 bits, vector quantization methods significantly outperform scalar methods:
Bits per weight vs quality (Llama-2-70B, PPL):
4.0 bits: GPTQ=3.41, AWQ=3.39, AQLM=3.38 (all similar, near-lossless)
3.0 bits: GPTQ=3.78, AQLM=3.62, QuIP#=3.55 (VQ starts winning)
2.5 bits: GPTQ=4.45, AQLM=3.89, QuIP#=3.74 (VQ clearly better)
2.0 bits: GPTQ=5.87, AQLM=4.21, QuIP#=3.98 (VQ dramatically better)
1.5 bits: GPTQ=N/A, AQLM=5.62, QuIP#=5.21 (only VQ methods work)
The crossover point where VQ methods become clearly superior is around 3 bits.
Below 3 bits, scalar quantization cannot represent the weight distribution with
enough fidelity, while VQ methods exploit inter-weight correlations.
Comparison with QuIP#
QuIP# (Quantization with Incoherence Processing, with lattice codebooks) is AQLM’s main competitor at extreme compression rates. The key differences:
# AQLM vs QuIP#: architectural differences
# AQLM:
# - Additive multi-codebook (learned codebooks)
# - Beam search assignment
# - Codebooks are optimized per-model via alternating minimization
# - Inference: gather from learned codebook tables
# QuIP#:
# - Lattice-based codebook (E8 lattice, mathematically optimal packing)
# - Incoherence processing: random Hadamard rotation of weights before quantization
# - Lattice codebooks are fixed (no per-model learning)
# - Inference: decode from lattice code (no table lookup needed)
# Key trade-off:
# QuIP# slightly better quality (lattice codes are near-optimal)
# AQLM slightly faster inference (simpler lookup, no Hadamard transforms)
# QuIP# much faster calibration (no iterative codebook optimization)
# AQLM more flexible (can trade off M, K, d for different bit rates)
def quip_sharp_pipeline(W, H_inv, num_bits=2):
"""QuIP# simplified pipeline."""
# Step 1: Incoherence processing
# Random orthogonal rotation to spread outlier channels
Q = random_hadamard_matrix(W.shape[1])
W_rotated = W @ Q
# Step 2: Quantize with E8 lattice codebook
# E8 lattice has 240 nearest neighbors, packs R^8 optimally
W_quantized = e8_lattice_quantize(W_rotated, num_bits)
# Step 3: The rotation Q must be applied at inference time
# Y = (W @ Q) @ (Q^T @ X) = W_quantized @ (Q^T @ X)
return W_quantized, Q.T
# QuIP# inference adds Q^T @ X at each layer (Hadamard transform)
# This is O(d * log(d)) per token, negligible for large hidden dims
AQLM vs QuIP# Trade-offs
| Property | AQLM | QuIP# |
|---|---|---|
| 2-bit PPL (Llama-2-70B) | 4.21 | 3.98 |
| Calibration time (70B, 8x A100) | 16 hours | 2 hours |
| Inference overhead | Codebook lookup | Hadamard transform |
| Decode throughput (tokens/sec) | ~5900 | ~5600 |
| Bit-rate flexibility | Continuous (vary M, K, d) | Discrete (lattice-dependent) |
| HuggingFace integration | transformers native | Separate library |
Production Deployment Considerations
When to Use 2-Bit Quantization
# Decision criteria for extreme compression
use_2bit = (
# The model does not fit at 4-bit on available hardware
(model_size_4bit_gb > available_vram_gb)
and
# The model DOES fit at 2-bit
(model_size_2bit_gb <= available_vram_gb)
and
# You cannot add more GPUs (cost or infrastructure constraint)
(cannot_add_gpus)
and
# The quality degradation is acceptable for your use case
(ppl_degradation_2bit < quality_threshold)
)
# Common scenario: running 70B model on a single 24GB GPU
# FP16: 140 GB -> needs 2x H100 or 8x RTX 4090
# 4-bit: 35 GB -> needs 2x RTX 4090
# 2-bit: 18 GB -> fits on 1x RTX 4090 with room for KV cache
# But consider: is 70B at 2-bit better than 13B at 4-bit?
# Llama-2-70B AQLM 2-bit PPL: 4.21
# Llama-2-13B GPTQ 4-bit PPL: 4.97
# The 70B at 2-bit wins! Larger models are more robust to compression.
HuggingFace Integration
# Loading AQLM models in HuggingFace transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
# Pre-quantized AQLM models are available on HuggingFace Hub
model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Llama-2-70b-AQLM-2Bit-1x16-hf",
device_map="auto",
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(
"ISTA-DASLab/Llama-2-70b-AQLM-2Bit-1x16-hf"
)
# Model configuration in the AQLM config:
# {
# "nbits_per_codebook": 16, # log2(K) per codebook
# "num_codebooks": 1, # M=1
# "out_group_size": 1, # groups along output dim
# "in_group_size": 8 # d=8
# }
# Effective bits: 1 * 16 / 8 = 2.0 bits per weight
Summary
AQLM enables usable 2-bit quantization through additive multi-codebook vector quantization, achieving roughly 1.5 perplexity points better than GPTQ at the same bit rate on large models. The key insight is that groups of weights have correlated structure that vector quantization can exploit, while scalar quantization treats each weight independently and wastes bits. The cost is a more complex inference kernel (codebook lookup vs simple dequantization) and a much longer calibration process (hours vs minutes). For deployments where the model does not fit at 4-bit and adding GPUs is not an option, AQLM provides the best quality at extreme compression rates. For most production deployments where 4-bit quantization suffices, the simpler GPTQ or AWQ methods remain preferable.