Part of Series Quantization Masterclass 23 of 30
1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest — Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier — Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 — How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization — The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization — OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment

At 4 bits per weight, GPTQ and AWQ deliver near-lossless quantization for most models above 7B parameters. At 3 bits, quality starts to degrade but remains usable. At 2 bits per weight, naive uniform quantization completely destroys the model. A 70B model in 2-bit uniform quantization produces incoherent text. Yet a 70B model at 2 bits per weight would fit in under 20 GB of memory, enabling single-GPU deployment on a consumer RTX 4090.

AQLM (Additive Quantization of Language Models) achieves usable 2-bit quantization through a fundamentally different approach: instead of quantizing each weight independently to one of 4 levels (which is what 2-bit uniform quantization does), AQLM groups weights into vectors and maps each vector to the nearest entry in a learned codebook. This is vector quantization, and it exploits the correlation structure between weights to achieve much better compression than element-wise quantization.

The Vector Quantization Formulation

Why Element-Wise Quantization Fails at 2 Bits

With 2 bits per weight, you have exactly 4 representable values. For symmetric uniform quantization:

{3s,s,s,3s}\{-3s, -s, s, 3s\}

where ss is the scale factor. This means every weight in the model is approximated by one of 4 values. The maximum relative error for any weight is bounded by:

ww^wsw\frac{|w - \hat{w}|}{|w|} \leq \frac{s}{|w|}

For weights near zero, the relative error explodes. For a weight of magnitude 0.0010.001 with scale s=0.1s = 0.1, the quantization maps it to either 00 or ±0.1\pm 0.1 --- a 100x distortion.

# Demonstration: 2-bit uniform quantization of a weight matrix
import torch

def quantize_uniform_2bit(W, symmetric=True):
    """Quantize each weight independently to 2 bits (4 levels)."""
    scale = W.abs().max() / 1.5  # For levels {-1.5s, -0.5s, 0.5s, 1.5s}
    W_q = torch.round(W / scale).clamp(-2, 1)  # Map to {-2, -1, 0, 1}
    W_dq = W_q * scale
    # MSE
    mse = ((W - W_dq) ** 2).mean()
    return W_dq, mse

# For a typical LLM weight matrix (4096 x 4096):
W = torch.randn(4096, 4096) * 0.02  # Normal-ish LLM weights
_, mse_2bit = quantize_uniform_2bit(W)
# mse_2bit ~ 3.3e-5  (very high relative to weight variance of 4e-4)
# Signal-to-quantization-noise ratio: ~10.8 dB (terrible)

AQLM: Additive Multi-Codebook Vector Quantization

AQLM groups consecutive weights into vectors of size dd (typically 8) and represents each vector as a sum of entries from multiple codebooks:

w^=m=1Mcm[im]\hat{\mathbf{w}} = \sum_{m=1}^{M} \mathbf{c}_{m}[i_m]

where MM is the number of codebooks, cm\mathbf{c}_m is the mm-th codebook (a matrix of KK vectors, each of dimension dd), and im{0,1,,K1}i_m \in \{0, 1, \ldots, K-1\} is the index into the mm-th codebook.

# AQLM representation
# Weight matrix W: [out_features, in_features]
# Group consecutive in_features into vectors of size d

class AQLMLayer:
    def __init__(self, out_features, in_features, num_codebooks=2,
                 codebook_size=256, group_size=8):
        self.out_features = out_features
        self.in_features = in_features
        self.num_codebooks = num_codebooks  # M
        self.codebook_size = codebook_size   # K (typically 256 = 8 bits per index)
        self.group_size = group_size          # d

        self.num_groups = in_features // group_size  # Number of vectors per row

        # Codebooks: M codebooks, each with K vectors of dimension d
        # Shape: [M, K, d]
        self.codebooks = torch.zeros(num_codebooks, codebook_size, group_size)

        # Indices: for each output row and each group, M indices into codebooks
        # Shape: [out_features, num_groups, M]
        self.indices = torch.zeros(out_features, self.num_groups, num_codebooks,
                                    dtype=torch.int16)

        # Per-output-channel scales (optional)
        self.scales = torch.ones(out_features)

    def decode_weights(self):
        """Reconstruct the weight matrix from codebooks and indices."""
        W = torch.zeros(self.out_features, self.in_features)
        for row in range(self.out_features):
            for g in range(self.num_groups):
                vec = torch.zeros(self.group_size)
                for m in range(self.num_codebooks):
                    idx = self.indices[row, g, m]
                    vec += self.codebooks[m, idx]  # Additive combination
                W[row, g*self.group_size:(g+1)*self.group_size] = vec
            W[row] *= self.scales[row]
        return W

# Bits per weight calculation:
# Each group of d weights is encoded as M indices, each log2(K) bits
# bits_per_weight = M * log2(K) / d + codebook_overhead
#
# Example: M=2 codebooks, K=256 (8-bit indices), d=8 (group size)
# bits_per_weight = 2 * 8 / 8 = 2.0 bits (plus negligible codebook overhead)
#
# Example: M=1 codebook, K=65536 (16-bit indices), d=8
# bits_per_weight = 1 * 16 / 8 = 2.0 bits
# Same effective bits but single codebook is less expressive than additive
ℹ️ Note

The additive structure is critical. A single codebook with K=216K = 2^{16} entries can represent 216=655362^{16} = 65536 distinct vectors. Two additive codebooks with K=28K = 2^8 each can represent 28×28=216=655362^8 \times 2^8 = 2^{16} = 65536 distinct sums, using the same number of bits. But the additive structure is far more parameter-efficient: 2×256×8=40962 \times 256 \times 8 = 4096 codebook parameters vs 65536×8=52428865536 \times 8 = 524288 codebook parameters.

Why Vector Quantization Works Better

Vector quantization exploits the joint distribution of weights within each group. Element-wise quantization treats each weight independently, discarding correlation information.

# Intuition: consider 2 weights (w1, w2)
# Element-wise 2-bit: 4 x 4 = 16 possible (w1_q, w2_q) pairs
# Vector 4-bit on (w1, w2): 16 possible (w1_q, w2_q) pairs
# Same number of states, but:
#   - Element-wise: states are on a regular grid
#   - Vector: states can be placed anywhere in 2D space
#
# If weights are correlated (as they are in trained neural nets),
# the codebook entries cluster in the populated region of weight space,
# reducing average distortion.

# Rate-distortion theory gives the theoretical advantage:
# For a d-dimensional Gaussian source at rate R bits per symbol,
# the distortion of optimal vector quantization approaches:
#   D_VQ = (1 / (2 * pi * e)) * 2^(-2R)  as d -> infinity
# vs scalar quantization:
#   D_SQ = (pi * e / 6) * 2^(-2R)
#
# The ratio D_SQ / D_VQ = (pi * e)^2 / 6 ≈ 4.35
# Vector quantization can be 4.35x better in MSE for Gaussian sources.
# For real weight distributions (non-Gaussian, correlated), the gain is larger.

The AQLM Optimization Algorithm

Objective Function

AQLM minimizes the layer-wise reconstruction error weighted by the Hessian:

minC,IWXW^(C,I)XF2\min_{\mathbf{C}, \mathbf{I}} \| \mathbf{W} \mathbf{X} - \hat{\mathbf{W}}(\mathbf{C}, \mathbf{I}) \mathbf{X} \|_F^2

where C\mathbf{C} denotes the codebook parameters and I\mathbf{I} denotes the index assignments. This is equivalent to:

minC,Itr((WW^)T(WW^)XXT)\min_{\mathbf{C}, \mathbf{I}} \text{tr}\left((\mathbf{W} - \hat{\mathbf{W}})^T (\mathbf{W} - \hat{\mathbf{W}}) \mathbf{X}\mathbf{X}^T\right)

The Hessian H=XXT\mathbf{H} = \mathbf{X}\mathbf{X}^T weights the error by the input correlations, ensuring that errors on high-activation dimensions are penalized more.

Alternating Optimization

AQLM alternates between two steps:

# Step 1: Fix codebooks C, optimize indices I (assignment step)
# For each weight vector, find the best combination of codebook entries

def optimize_indices(W_group, codebooks, H_group):
    """
    W_group: [out_features, d] - one group of weight vectors
    codebooks: [M, K, d] - M codebooks, K entries each, dimension d
    H_group: [d, d] - Hessian block for this group

    Find indices i_1, ..., i_M for each row that minimize:
    (W - sum_m C_m[i_m])^T H (W - sum_m C_m[i_m])
    """
    M, K, d = codebooks.shape
    out_features = W_group.shape[0]
    best_indices = torch.zeros(out_features, M, dtype=torch.long)

    # Beam search: maintain top-B candidates
    beam_width = 16  # Number of candidates to keep

    for row in range(out_features):
        w = W_group[row]  # [d]

        # Initialize beam with best single-codebook assignments
        # For codebook 0:
        residuals = w.unsqueeze(0) - codebooks[0]  # [K, d]
        costs = (residuals @ H_group * residuals).sum(dim=1)  # [K]
        topk_costs, topk_indices = costs.topk(beam_width, largest=False)

        beam = []
        for b in range(beam_width):
            beam.append({
                'indices': [topk_indices[b].item()],
                'residual': w - codebooks[0, topk_indices[b]],
                'cost': topk_costs[b].item()
            })

        # Extend beam for codebooks 1, ..., M-1
        for m in range(1, M):
            candidates = []
            for entry in beam:
                res = entry['residual']
                # Try all K entries in codebook m
                new_residuals = res.unsqueeze(0) - codebooks[m]  # [K, d]
                new_costs = (new_residuals @ H_group * new_residuals).sum(dim=1)
                topk_c, topk_i = new_costs.topk(
                    min(beam_width, K), largest=False
                )

                for j in range(min(beam_width, K)):
                    candidates.append({
                        'indices': entry['indices'] + [topk_i[j].item()],
                        'residual': res - codebooks[m, topk_i[j]],
                        'cost': topk_c[j].item()
                    })

            # Keep top beam_width candidates
            candidates.sort(key=lambda x: x['cost'])
            beam = candidates[:beam_width]

        best_indices[row] = torch.tensor(beam[0]['indices'])

    return best_indices

# Step 2: Fix indices I, optimize codebooks C (codebook update step)
# This is a least-squares problem for each codebook entry

def optimize_codebooks(W_groups, indices, H_groups, M, K, d):
    """
    Given fixed index assignments, find optimal codebook vectors
    by solving a weighted least-squares problem.

    For codebook m, entry k, the optimal vector c_m[k] minimizes:
    sum_{rows assigned to k} (w_row - sum_{m'!=m} c_m'[i_m'] - c_m[k])^T H (...)
    """
    codebooks = torch.zeros(M, K, d)

    for m in range(M):
        for k in range(K):
            # Find all weight vectors that use entry k in codebook m
            # Compute their residuals (after subtracting other codebook contributions)
            # Solve weighted least squares: H @ c = H @ mean_residual
            assigned_residuals = []
            assigned_H = []

            for row_idx, group_idx in enumerate_assignments(indices, m, k):
                w = W_groups[group_idx][row_idx]
                # Residual: w minus contributions from other codebooks
                residual = w.clone()
                for m_other in range(M):
                    if m_other != m:
                        idx_other = indices[row_idx, group_idx, m_other]
                        residual -= codebooks[m_other, idx_other]
                assigned_residuals.append(residual)
                assigned_H.append(H_groups[group_idx])

            if len(assigned_residuals) > 0:
                # Weighted average: c_m[k] = (sum H_i)^{-1} @ (sum H_i @ r_i)
                H_sum = torch.stack(assigned_H).sum(dim=0)
                Hr_sum = sum(H @ r for H, r in zip(assigned_H, assigned_residuals))
                codebooks[m, k] = torch.linalg.solve(
                    H_sum + 1e-6 * torch.eye(d), Hr_sum
                )

    return codebooks
Performance

The beam search in the index optimization step is the computational bottleneck. With M=2M = 2 codebooks, K=256K = 256 entries, and beam width B=16B = 16, each weight vector requires evaluating K+B×K=256+16×256=4352K + B \times K = 256 + 16 \times 256 = 4352 candidate combinations. For a 7B model with roughly 10910^9 weight groups, this is 4×1012\sim 4 \times 10^{12} evaluations. AQLM calibration takes 4-20 hours on 8 A100 GPUs depending on model size and number of optimization iterations.

Codebook Initialization and Training Schedule

K-Means Initialization

The initial codebook quality significantly affects the final result. AQLM uses K-means clustering on the weight vectors:

def initialize_codebooks_kmeans(W, num_codebooks, codebook_size, group_size):
    """
    Initialize codebooks using K-means on weight vector groups.

    For M > 1 codebooks, use residual initialization:
    1. Run K-means on weight vectors -> codebook 0
    2. Compute residuals (w - nearest codebook 0 entry)
    3. Run K-means on residuals -> codebook 1
    4. Repeat for codebook 2, etc.
    """
    out_features, in_features = W.shape
    num_groups = in_features // group_size

    # Reshape to weight vectors
    W_vectors = W.reshape(out_features * num_groups, group_size)  # [N_vectors, d]

    codebooks = torch.zeros(num_codebooks, codebook_size, group_size)
    residuals = W_vectors.clone()

    for m in range(num_codebooks):
        # K-means on current residuals
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=codebook_size, n_init=3, max_iter=100)
        kmeans.fit(residuals.cpu().numpy())

        codebooks[m] = torch.from_numpy(kmeans.cluster_centers_)

        # Update residuals
        assignments = torch.from_numpy(kmeans.labels_)
        for i in range(len(residuals)):
            residuals[i] -= codebooks[m, assignments[i]]

    return codebooks

# Training schedule:
# Initialize codebooks via residual K-means
# For t = 1, ..., T (T ~ 10-25 iterations):
#    a. Optimize indices (beam search) - ~80% of compute
#    b. Optimize codebooks (least squares) - ~15% of compute
#    c. Fine-tune with straight-through estimator (optional) - ~5%
# Export quantized model with codebooks + indices

Fine-Tuning with Straight-Through Estimator

AQLM can optionally fine-tune the codebooks end-to-end using gradient descent with the straight-through estimator (STE):

def aqlm_finetune_step(model_aqlm, batch, optimizer):
    """
    Fine-tune codebooks with straight-through gradient estimation.
    The forward pass uses codebook lookup (discrete).
    The backward pass passes gradients through as if the lookup were identity.
    """
    # Forward: decode weights from codebooks, run model
    for layer in model_aqlm.layers:
        for linear in layer.linear_modules:
            # Decode weight matrix from codebooks + indices
            W_decoded = decode_from_codebooks(
                linear.codebooks,  # [M, K, d] - these are the learnable params
                linear.indices,     # [out, num_groups, M] - fixed after index opt
                linear.scales       # [out] - also learnable
            )
            linear.weight_for_forward = W_decoded

    loss = model_aqlm(batch)
    loss.backward()

    # Gradients flow through to codebook entries via STE
    # Only codebook vectors and scales are updated, not indices
    optimizer.step()
    optimizer.zero_grad()

Inference Kernel Design

The Codebook Lookup Kernel

At inference time, the AQLM kernel must decode weight vectors from codebook indices and compute the matrix multiply. The critical challenge is that codebook lookup replaces the simple memory read of a standard quantized kernel with an indexed gather operation.

// AQLM inference kernel: codebook-based dequantization + GEMM
// For each output element C[i,j]:
// C[i,j] = sum over groups g of:
//   dot(X[j, g*d : (g+1)*d],  sum_m codebook[m][ indices[i,g,m] ])

// Fused kernel: codebook lookup + dot product in shared memory

__global__ void aqlm_gemv_kernel(
    const half* __restrict__ X,          // [1, in_features] (single token)
    const uint16_t* __restrict__ indices, // [out_features, num_groups, M]
    const half* __restrict__ codebooks,   // [M, K, d]
    const half* __restrict__ scales,       // [out_features]
    half* __restrict__ output,             // [1, out_features]
    int out_features, int num_groups, int M, int K, int d
) {
    // Each thread block handles one output row
    int row = blockIdx.x;
    if (row >= out_features) return;

    // Load codebook into shared memory (fits for small K*d)
    extern __shared__ half smem[];
    half* codebook_smem = smem;  // [M * K * d]

    // Cooperative load of codebooks
    int total_cb_elems = M * K * d;
    for (int i = threadIdx.x; i < total_cb_elems; i += blockDim.x) {
        codebook_smem[i] = codebooks[i];
    }
    __syncthreads();

    // Compute dot product: sum over groups
    float local_sum = 0.0f;

    for (int g = threadIdx.x; g < num_groups; g += blockDim.x) {
        // Decode weight vector for this group
        half w_vec[8];  // Assuming d=8
        for (int dd = 0; dd < d; dd++) w_vec[dd] = __float2half(0.0f);

        for (int m = 0; m < M; m++) {
            uint16_t idx = indices[row * num_groups * M + g * M + m];
            for (int dd = 0; dd < d; dd++) {
                w_vec[dd] = __hadd(w_vec[dd],
                    codebook_smem[m * K * d + idx * d + dd]);
            }
        }

        // Dot product with input
        for (int dd = 0; dd < d; dd++) {
            float x_val = __half2float(X[g * d + dd]);
            float w_val = __half2float(w_vec[dd]);
            local_sum += x_val * w_val;
        }
    }

    // Warp reduction
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        local_sum += __shfl_down_sync(0xffffffff, local_sum, offset);
    }

    // Block reduction
    __shared__ float warp_sums[32];
    int warp_id = threadIdx.x / warpSize;
    int lane_id = threadIdx.x % warpSize;
    if (lane_id == 0) warp_sums[warp_id] = local_sum;
    __syncthreads();

    if (threadIdx.x == 0) {
        float total = 0.0f;
        for (int w = 0; w < (blockDim.x + warpSize - 1) / warpSize; w++) {
            total += warp_sums[w];
        }
        output[row] = __float2half(total * __half2float(scales[row]));
    }
}

Performance Characteristics

The codebook lookup introduces a fundamentally different memory access pattern compared to standard dequantization:

Standard INT4 dequantization (GPTQ/AWQ):
  - Read: 4 bits per weight, contiguous
  - Compute: shift, mask, multiply by scale
  - Memory pattern: sequential, coalesced
  - Arithmetic intensity: high (simple dequant)

AQLM codebook lookup:
  - Read: 8-16 bit index per group of 8 weights
  - Gather: codebook[index] (random access into codebook)
  - Add: sum M codebook entries
  - Memory pattern: random (codebook gather), non-coalesced
  - Arithmetic intensity: lower (random access bottleneck)

If codebook fits in L1/L2 cache:
  - Latency per lookup: ~30 cycles (L1 hit) or ~200 cycles (L2 hit)
  - M=2 codebooks, K=256, d=8, FP16: 2 * 256 * 8 * 2 = 8 KB per codebook
  - Total: 2 * 8 KB = 16 KB -> fits comfortably in 128 KB L1 per SM

If codebook does NOT fit in cache:
  - Latency per lookup: ~500-800 cycles (HBM access)
  - Performance degrades significantly

AQLM 2-bit vs GPTQ 4-bit Decode Throughput (Llama-2-7B, H100)

(tokens/sec)
FP16 (baseline) 13.5 GB
4,200 tokens/sec
GPTQ 4-bit 3.5 GB
6,800 tokens/sec
AWQ 4-bit 3.5 GB
7,100 tokens/sec
AQLM 2-bit (M=2, K=256) 2.0 GB, codebook overhead
5,900 tokens/sec
AQLM 3-bit (M=1, K=4096) 2.6 GB
6,200 tokens/sec
⚠️ Warning

AQLM at 2 bits is slower than GPTQ at 4 bits despite using less memory, because the codebook gather operation has lower throughput than the simple bit-shift dequantization used by GPTQ. The memory bandwidth savings from 2-bit vs 4-bit weights are partially offset by the codebook access overhead. AQLM wins on memory capacity (fitting larger models on fewer GPUs) but not on raw throughput per GPU.

Quality Comparison at Extreme Compression

2-Bit Quantization Results

# Perplexity comparison at 2 bits per weight
# All methods use 128 calibration samples from WikiText-2
# Evaluated on WikiText-2 test set

results = {
    "Llama-2-7B": {
        "FP16":         5.47,
        "GPTQ 2-bit":   12.8,   # Barely functional
        "RTN 2-bit":    142.0,  # Complete failure
        "AQLM 2-bit":   7.89,   # Usable
        "QuIP# 2-bit":  7.45,   # Best at 2-bit
    },
    "Llama-2-13B": {
        "FP16":         4.88,
        "GPTQ 2-bit":   9.21,
        "RTN 2-bit":    87.3,
        "AQLM 2-bit":   6.34,
        "QuIP# 2-bit":  6.02,
    },
    "Llama-2-70B": {
        "FP16":         3.32,
        "GPTQ 2-bit":   5.87,
        "RTN 2-bit":    28.4,
        "AQLM 2-bit":   4.21,
        "QuIP# 2-bit":  3.98,
    },
}
📊

2-Bit Quantization Quality (WikiText-2 Perplexity, Lower is Better)

MethodLlama-2-7BLlama-2-13BLlama-2-70BAvg Degradation
FP16 (baseline) 5.47 4.88 3.32 0
RTN 2-bit 142.0 87.3 28.4 Destroyed
GPTQ 2-bit 12.8 9.21 5.87 +5.1
AQLM 2-bit (M=2, K=256, d=8) 7.89 6.34 4.21 +1.6
QuIP# 2-bit 7.45 6.02 3.98 +1.3
GPTQ 4-bit (reference) 5.62 4.97 3.41 +0.12

Quality vs Bits-Per-Weight Frontier

AQLM and QuIP# define the Pareto frontier for extreme compression. Below 3 bits, vector quantization methods significantly outperform scalar methods:

Bits per weight vs quality (Llama-2-70B, PPL):

4.0 bits: GPTQ=3.41, AWQ=3.39, AQLM=3.38  (all similar, near-lossless)
3.0 bits: GPTQ=3.78, AQLM=3.62, QuIP#=3.55 (VQ starts winning)
2.5 bits: GPTQ=4.45, AQLM=3.89, QuIP#=3.74 (VQ clearly better)
2.0 bits: GPTQ=5.87, AQLM=4.21, QuIP#=3.98 (VQ dramatically better)
1.5 bits: GPTQ=N/A,  AQLM=5.62, QuIP#=5.21 (only VQ methods work)

The crossover point where VQ methods become clearly superior is around 3 bits.
Below 3 bits, scalar quantization cannot represent the weight distribution with
enough fidelity, while VQ methods exploit inter-weight correlations.

Comparison with QuIP#

QuIP# (Quantization with Incoherence Processing, with lattice codebooks) is AQLM’s main competitor at extreme compression rates. The key differences:

# AQLM vs QuIP#: architectural differences

# AQLM:
# - Additive multi-codebook (learned codebooks)
# - Beam search assignment
# - Codebooks are optimized per-model via alternating minimization
# - Inference: gather from learned codebook tables

# QuIP#:
# - Lattice-based codebook (E8 lattice, mathematically optimal packing)
# - Incoherence processing: random Hadamard rotation of weights before quantization
# - Lattice codebooks are fixed (no per-model learning)
# - Inference: decode from lattice code (no table lookup needed)

# Key trade-off:
# QuIP# slightly better quality (lattice codes are near-optimal)
# AQLM slightly faster inference (simpler lookup, no Hadamard transforms)
# QuIP# much faster calibration (no iterative codebook optimization)
# AQLM more flexible (can trade off M, K, d for different bit rates)

def quip_sharp_pipeline(W, H_inv, num_bits=2):
    """QuIP# simplified pipeline."""
    # Step 1: Incoherence processing
    # Random orthogonal rotation to spread outlier channels
    Q = random_hadamard_matrix(W.shape[1])
    W_rotated = W @ Q

    # Step 2: Quantize with E8 lattice codebook
    # E8 lattice has 240 nearest neighbors, packs R^8 optimally
    W_quantized = e8_lattice_quantize(W_rotated, num_bits)

    # Step 3: The rotation Q must be applied at inference time
    # Y = (W @ Q) @ (Q^T @ X) = W_quantized @ (Q^T @ X)
    return W_quantized, Q.T

# QuIP# inference adds Q^T @ X at each layer (Hadamard transform)
# This is O(d * log(d)) per token, negligible for large hidden dims
📊

AQLM vs QuIP# Trade-offs

PropertyAQLMQuIP#
2-bit PPL (Llama-2-70B) 4.21 3.98
Calibration time (70B, 8x A100) 16 hours 2 hours
Inference overhead Codebook lookup Hadamard transform
Decode throughput (tokens/sec) ~5900 ~5600
Bit-rate flexibility Continuous (vary M, K, d) Discrete (lattice-dependent)
HuggingFace integration transformers native Separate library

Production Deployment Considerations

When to Use 2-Bit Quantization

# Decision criteria for extreme compression

use_2bit = (
    # The model does not fit at 4-bit on available hardware
    (model_size_4bit_gb > available_vram_gb)
    and
    # The model DOES fit at 2-bit
    (model_size_2bit_gb <= available_vram_gb)
    and
    # You cannot add more GPUs (cost or infrastructure constraint)
    (cannot_add_gpus)
    and
    # The quality degradation is acceptable for your use case
    (ppl_degradation_2bit < quality_threshold)
)

# Common scenario: running 70B model on a single 24GB GPU
# FP16: 140 GB -> needs 2x H100 or 8x RTX 4090
# 4-bit: 35 GB -> needs 2x RTX 4090
# 2-bit: 18 GB -> fits on 1x RTX 4090 with room for KV cache

# But consider: is 70B at 2-bit better than 13B at 4-bit?
# Llama-2-70B AQLM 2-bit PPL: 4.21
# Llama-2-13B GPTQ 4-bit PPL: 4.97
# The 70B at 2-bit wins! Larger models are more robust to compression.

HuggingFace Integration

# Loading AQLM models in HuggingFace transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

# Pre-quantized AQLM models are available on HuggingFace Hub
model = AutoModelForCausalLM.from_pretrained(
    "ISTA-DASLab/Llama-2-70b-AQLM-2Bit-1x16-hf",
    device_map="auto",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(
    "ISTA-DASLab/Llama-2-70b-AQLM-2Bit-1x16-hf"
)

# Model configuration in the AQLM config:
# {
#   "nbits_per_codebook": 16,   # log2(K) per codebook
#   "num_codebooks": 1,          # M=1
#   "out_group_size": 1,         # groups along output dim
#   "in_group_size": 8           # d=8
# }
# Effective bits: 1 * 16 / 8 = 2.0 bits per weight

Summary

AQLM enables usable 2-bit quantization through additive multi-codebook vector quantization, achieving roughly 1.5 perplexity points better than GPTQ at the same bit rate on large models. The key insight is that groups of weights have correlated structure that vector quantization can exploit, while scalar quantization treats each weight independently and wastes bits. The cost is a more complex inference kernel (codebook lookup vs simple dequantization) and a much longer calibration process (hours vs minutes). For deployments where the model does not fit at 4-bit and adding GPUs is not an option, AQLM provides the best quality at extreme compression rates. For most production deployments where 4-bit quantization suffices, the simpler GPTQ or AWQ methods remain preferable.