INT4 quantization is mature and production-ready. But the fundamental question remains: how low can you go? At 4 bits, a 70B model fits on a single 80 GB GPU. At 2 bits, it would fit on a 24 GB RTX 4090. At 1.58 bits (ternary), it would require just 14 GB. At 1 bit (binary), just 9 GB. Each halving of precision opens new deployment targets — but at what quality cost? This post surveys the frontier of sub-4-bit quantization: the methods (QuIP#, AQLM, BitNet, HQQ), the theoretical limits, the quality-compression trade-offs, and the hardware that would make extreme quantization practical.
The Precision Spectrum
Model Size by Precision (70B Parameter Model)
| Precision | Bits/Weight | Model Size | Fits On | Quality (PPL delta) | Status (2025) |
|---|---|---|---|---|---|
| FP16 | 16 | 140 GB | 2x H100 (TP) | Baseline | Production |
| FP8 | 8 | 70 GB | 1x H100 | +0.1-0.3% | Production |
| INT4 (GPTQ/AWQ) | 4.2 | 37 GB | 1x A100-40GB | +3-8% | Production |
| 3-bit (AQLM) | 3.1 | 27 GB | 1x RTX 4090 (24GB) | +8-15% | Research |
| 2-bit (QuIP#) | 2.1 | 18 GB | 1x RTX 3090 (24GB) | +15-40% | Research |
| Ternary (BitNet 1.58b) | 1.58 | 14 GB | 1x RTX 3060 (12GB) | +5-25%* | Research |
| Binary (1-bit) | 1.0 | 9 GB | 1x RTX 3060 (12GB) | +40-100% | Experimental |
Model Size Compression vs FP16 (70B Model)
(GB)3-Bit Quantization: AQLM and QuIP
AQLM (Additive Quantization for Language Models)
AQLM uses vector quantization: instead of quantizing individual weights, it groups weights into vectors and maps each vector to its nearest entry in a learned codebook.
# AQLM quantization concept
# Instead of: weight -> scale * int4_value
# AQLM uses: weight_vector -> codebook[index]
import torch
class AQLMQuantizer:
def __init__(self, codebook_size=256, vector_dim=8):
"""
codebook_size: 256 entries = 8-bit index per vector
vector_dim: 8 weights per vector
Effective bits per weight: 8 / 8 = 1.0 bits
With 2 codebooks: 16 / 8 = 2.0 bits
With 3 codebooks: 24 / 8 = 3.0 bits
"""
self.codebook_size = codebook_size
self.vector_dim = vector_dim
self.codebook = None # Learned during calibration
def learn_codebook(self, weight_matrix):
"""Learn codebook entries using k-means on weight vectors."""
K, N = weight_matrix.shape
# Reshape into vectors
num_vectors = (K * N) // self.vector_dim
vectors = weight_matrix.reshape(num_vectors, self.vector_dim)
# K-means clustering
codebook, assignments = kmeans(
vectors, self.codebook_size, max_iter=100
)
self.codebook = codebook # [codebook_size, vector_dim]
return assignments # [num_vectors] indices into codebook
def dequantize(self, indices, shape):
"""Reconstruct weight matrix from codebook indices."""
vectors = self.codebook[indices] # [num_vectors, vector_dim]
return vectors.reshape(shape)
def effective_bits(self, num_codebooks=2):
"""Calculate effective bits per weight."""
import math
index_bits = math.log2(self.codebook_size) # 8 bits for 256 entries
return (num_codebooks * index_bits) / self.vector_dim
AQLM with 2 codebooks achieves approximately 2 bits per weight. With 3 codebooks, it achieves 3 bits. The quality is significantly better than naive round-to-nearest at the same bit width because the codebook entries are optimized to minimize reconstruction error.
QuIP# (Quantization with Incoherence Processing)
QuIP# applies a random orthogonal rotation to the weight matrix before quantization. This “incoherence processing” spreads outlier energy across all weights, making the weight distribution more uniform and easier to quantize at low bit widths.
def quip_incoherence_transform(weight, hadamard_dim=None):
"""Apply incoherence processing via random Hadamard rotation.
The key insight: if W has outlier columns, H @ W @ H^T
(where H is a random orthogonal matrix) spreads those outliers
across all columns, reducing max-to-mean ratio.
"""
K, N = weight.shape
# Generate random Hadamard-like rotation
# In practice, use the Walsh-Hadamard transform (O(N log N))
# with random sign flips for randomization
signs_k = (torch.randint(0, 2, (K,)) * 2 - 1).float()
signs_n = (torch.randint(0, 2, (N,)) * 2 - 1).float()
# Apply: W_rotated = diag(signs_k) @ H_k @ W @ H_n^T @ diag(signs_n)
w_rotated = weight * signs_k.unsqueeze(1)
w_rotated = hadamard_transform(w_rotated, dim=0)
w_rotated = w_rotated * signs_n.unsqueeze(0)
w_rotated = hadamard_transform(w_rotated, dim=1)
return w_rotated, signs_k, signs_n
def quip_quantize(weight, bits=2, lattice='E8'):
"""Quantize using QuIP# with lattice codebook."""
# Step 1: Incoherence transform
w_rotated, signs_k, signs_n = quip_incoherence_transform(weight)
# Step 2: Quantize to lattice points
# E8 lattice: 8-dimensional lattice with 240 nearest neighbors
# Each 8-dimensional vector maps to the nearest E8 lattice point
# Encoding: ~1-2 bits per dimension
if lattice == 'E8':
# E8 lattice quantization
quantized = e8_lattice_quantize(w_rotated, bits)
else:
# Standard round-to-nearest
quantized = round_to_nearest(w_rotated, bits)
return quantized, signs_k, signs_n
A lattice codebook in dimensions has exponentially more entries than independent scalar codebooks. The E8 lattice in 8 dimensions provides 240 nearest-neighbor codewords at ~1 bit/dimension, achieving better rate-distortion trade-offs than per-element quantization. This is the mathematical reason QuIP# outperforms naive 2-bit quantization by a wide margin.
3-Bit and 2-Bit Quality Comparison (Llama 2 70B)
| Method | Effective Bits | WikiText PPL | Delta vs FP16 | MMLU Acc | Quantization Time |
|---|---|---|---|---|---|
| FP16 (baseline) | 16.0 | 3.32 | 0% | 68.9% | N/A |
| GPTQ INT4 g128 | 4.2 | 3.58 | +7.8% | 63.2% | 4 hours |
| AQLM 3-bit (2 codebooks) | 3.0 | 3.72 | +12.0% | 60.8% | 12 hours |
| AQLM 2-bit (2 codebooks) | 2.0 | 4.45 | +34.0% | 54.2% | 12 hours |
| QuIP# 3-bit | 3.0 | 3.65 | +9.9% | 61.5% | 8 hours |
| QuIP# 2-bit | 2.0 | 4.15 | +25.0% | 56.8% | 8 hours |
| Naive RTN 2-bit | 2.0 | 12.4 | +273% | 28.1% | 10 min |
Ternary Networks: BitNet 1.58b
BitNet 1.58b restricts weights to three values: 1. This is 1.58 bits per weight (). The critical difference from post-training quantization: BitNet trains from scratch with ternary constraints.
BitNet Architecture
class BitLinear(torch.nn.Module):
"""BitNet 1.58b linear layer with ternary weights."""
def __init__(self, in_features, out_features):
super().__init__()
# Ternary weights: stored as int8 but values are {-1, 0, 1}
self.weight = torch.nn.Parameter(
torch.randint(-1, 2, (out_features, in_features)).float()
)
self.scale = torch.nn.Parameter(torch.ones(1))
def ternarize_weight(self):
"""Quantize weights to {-1, 0, +1} during forward pass."""
# Absmean quantization
gamma = self.weight.abs().mean()
w_ternary = torch.sign(self.weight) * (self.weight.abs() > gamma * 0.5).float()
return w_ternary, gamma
def forward(self, x):
# Quantize activations to INT8
x_scale = 127.0 / x.abs().max().clamp(min=1e-5)
x_quant = (x * x_scale).round().clamp(-128, 127)
# Ternarize weights
w_ternary, w_scale = self.ternarize_weight()
# Matrix multiply: only additions and subtractions!
# y = x_quant @ w_ternary.T
# This is NOT a multiply -- it's: sum where w=1, subtract where w=-1, skip where w=0
y = torch.nn.functional.linear(x_quant, w_ternary)
# Rescale
y = y * (w_scale / x_scale) * self.scale
return y
Compute Implications
The multiplication in ternary matrix multiply degenerates to addition and subtraction:
This eliminates the need for multiply-accumulate (MAC) units entirely. A ternary GEMM requires only integer additions, which are cheaper and faster than FP16 multiplications.
// Ternary GEMM kernel: no multiplies, only add/subtract
__global__ void ternary_gemv(
const int8_t* __restrict__ x, // [K] input activations
const int8_t* __restrict__ w_ternary, // [N, K] ternary weights {-1, 0, +1}
int32_t* __restrict__ y, // [N] output
int K, int N
) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= N) return;
int32_t acc = 0;
for (int k = 0; k < K; k++) {
int8_t w = w_ternary[row * K + k];
if (w == 1) {
acc += x[k]; // Addition only
} else if (w == -1) {
acc -= x[k]; // Subtraction only
}
// w == 0: skip (sparsity)
}
y[row] = acc;
}
// Optimized: pack ternary as 2-bit and use bit manipulation
// 00 = 0, 01 = +1, 11 = -1
__global__ void ternary_gemv_packed(
const int8_t* __restrict__ x,
const uint32_t* __restrict__ w_packed, // 16 ternary values per uint32
int32_t* __restrict__ y,
int K, int N
) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= N) return;
int32_t acc = 0;
for (int k_packed = 0; k_packed < K / 16; k_packed++) {
uint32_t packed = w_packed[row * (K/16) + k_packed];
#pragma unroll
for (int i = 0; i < 16; i++) {
uint8_t bits = (packed >> (i * 2)) & 0x3;
int k = k_packed * 16 + i;
if (bits == 0x1) acc += x[k]; // +1
else if (bits == 0x3) acc -= x[k]; // -1
// bits == 0x0: zero, skip
}
}
y[row] = acc;
}
Theoretical Compute Cost per Token (Llama 70B equivalent)
(TOPS required)Binary Neural Networks
Binary networks restrict weights to 1 (1 bit per weight). The matrix multiply becomes an XNOR operation followed by a population count (popcount):
where is XNOR and is the vector dimension.
// Binary GEMV: XNOR + popcount
// Pack 32 binary weights into a single uint32
__global__ void binary_gemv(
const uint32_t* __restrict__ x_packed, // [K/32] binary activations
const uint32_t* __restrict__ w_packed, // [N, K/32] binary weights
int32_t* __restrict__ y,
int K, int N
) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= N) return;
int32_t acc = 0;
for (int k32 = 0; k32 < K / 32; k32++) {
uint32_t xb = x_packed[k32];
uint32_t wb = w_packed[row * (K/32) + k32];
uint32_t xnor = ~(xb ^ wb); // XNOR: 1 where bits match
acc += __popc(xnor); // Count matching bits
}
// Convert popcount to signed dot product:
// dot = 2 * popcount - K
y[row] = 2 * acc - K;
}
Binary networks are theoretically 32x more memory-efficient than FP32 and require no multiplications. But quality degradation for LLMs is severe — no post-training binary quantization of a pretrained LLM produces usable output. Binary LLMs must be trained from scratch, and even then, quality lags significantly behind FP16 models of the same parameter count.
As of 2025, no binary (1-bit) LLM achieves competitive quality on standard benchmarks. The information capacity of 1 bit per weight is fundamentally limited by the rate-distortion bound. A 70B binary model stores 70 billion bits = 8.75 GB of information, equivalent to a 4.4B FP16 model. The parameter count is misleading — it is the total information content that determines capability.
Information-Theoretic Limits
The fundamental question: what is the minimum number of bits needed to represent a neural network without quality loss?
The rate-distortion bound from information theory states:
where is the variance of the weight distribution and is the allowed distortion (mean squared error). For a typical LLM weight distribution with :
import math
def rate_distortion_bound(sigma_w, target_ppl_increase_pct):
"""Estimate minimum bits per weight for given quality target.
This is a rough estimate -- actual minimum depends on the
specific model's sensitivity to weight perturbation.
"""
# Map PPL increase to allowed MSE (empirical relationship)
# ~1% PPL increase corresponds to ~1e-6 MSE for large models
allowed_mse = (target_ppl_increase_pct / 100) * 1e-6
if allowed_mse <= 0 or allowed_mse >= sigma_w**2:
return float('inf') if allowed_mse <= 0 else 0
rate = 0.5 * math.log2(sigma_w**2 / allowed_mse)
return rate
# For various quality targets
sigma_w = 0.01 # Typical weight std dev
for ppl_delta in [1, 5, 10, 20, 50]:
min_bits = rate_distortion_bound(sigma_w, ppl_delta)
print(f"PPL increase {ppl_delta}%: minimum ~{min_bits:.1f} bits/weight")
# PPL increase 1%: minimum ~6.7 bits/weight
# PPL increase 5%: minimum ~5.5 bits/weight
# PPL increase 10%: minimum ~5.0 bits/weight
# PPL increase 20%: minimum ~4.4 bits/weight
# PPL increase 50%: minimum ~3.3 bits/weight
Theoretical vs Achieved Compression (70B Model)
| Quality Target (PPL delta) | Rate-Distortion Bound | Best Achieved (2025) | Method | Gap to Bound |
|---|---|---|---|---|
| +1% | ~6.7 bits | 8.0 bits (FP8) | FP8 per-channel | 1.3 bits |
| +5% | ~5.5 bits | 4.2 bits (AWQ INT4) | AWQ g128 | Below bound* |
| +10% | ~5.0 bits | 3.0 bits (QuIP#) | QuIP# 3-bit | Below bound* |
| +25% | ~4.0 bits | 2.0 bits (QuIP#) | QuIP# 2-bit | Below bound* |
| +50% | ~3.3 bits | 1.58 bits (BitNet) | BitNet from scratch | Below bound* |
Hardware for Sub-4-Bit Inference
Current GPUs are not optimized for sub-4-bit datatypes. Tensor Cores support FP16, BF16, FP8, and (on Blackwell) INT4, but not 3-bit, 2-bit, or ternary. Efficient sub-4-bit inference requires specialized hardware.
Ternary Processing Units
A ternary multiply unit replaces the multiplier with a multiplexer:
FP16 Multiply: 16-bit * 16-bit → 32-bit (hundreds of gates)
INT4 Multiply: 4-bit * 4-bit → 8-bit (tens of gates)
Ternary "Multiply": select(+x, 0, -x) (2 gates: mux + negate)
Binary "Multiply": XNOR (1 gate)
The area and energy savings are dramatic:
Compute Unit Comparison by Precision
| Operation | Gate Count | Energy (pJ) | Area (um^2) | Relative to FP16 |
|---|---|---|---|---|
| FP16 FMA | ~4000 | ~3.7 | ~2500 | 1.0x (baseline) |
| INT8 MAC | ~600 | ~0.9 | ~400 | 0.15x |
| INT4 MAC | ~200 | ~0.3 | ~150 | 0.05x |
| Ternary (mux) | ~10 | ~0.02 | ~8 | 0.003x |
| Binary (XNOR) | ~2 | ~0.005 | ~2 | 0.0005x |
Software Emulation on GPUs
Without native hardware support, sub-4-bit inference on GPUs requires software emulation:
def sub4bit_gemv_throughput(bits, gpu_mem_bw_gb_s=3350, gpu_tops=990):
"""Estimate throughput for sub-4-bit GEMV on H100.
Sub-4-bit GEMV is memory-bandwidth bound (batch=1).
Throughput scales with compression ratio.
"""
# FP16 baseline: 2 bytes per weight
fp16_bytes_per_weight = 2.0
sub4_bytes_per_weight = bits / 8.0
# Memory BW bound throughput
# For 70B model, single decode step reads all weights
model_params = 70e9
fp16_model_bytes = model_params * fp16_bytes_per_weight
sub4_model_bytes = model_params * sub4_bytes_per_weight
fp16_time_ms = (fp16_model_bytes / (gpu_mem_bw_gb_s * 1e9)) * 1000
sub4_time_ms = (sub4_model_bytes / (gpu_mem_bw_gb_s * 1e9)) * 1000
# Add dequantization overhead (~20% for sub-4-bit)
sub4_time_ms *= 1.2
fp16_tok_s = 1000 / fp16_time_ms
sub4_tok_s = 1000 / sub4_time_ms
return {
'fp16_ms': fp16_time_ms,
'sub4_ms': sub4_time_ms,
'fp16_tok_s': fp16_tok_s,
'sub4_tok_s': sub4_tok_s,
'speedup': fp16_time_ms / sub4_time_ms
}
for bits in [4, 3, 2, 1.58, 1]:
r = sub4bit_gemv_throughput(bits)
print(f"{bits:.2f} bits: {r['sub4_tok_s']:.0f} tok/s "
f"({r['speedup']:.1f}x vs FP16)")
Theoretical Decode Speed by Bit Width (70B Model, H100, batch=1)
(tokens/sec)HQQ: Half-Quadratic Quantization
HQQ is a recent method that achieves competitive sub-4-bit quality without requiring calibration data:
def hqq_quantize(weight, bits=2, group_size=64, num_iters=20):
"""Half-Quadratic Quantization: no calibration data needed.
Optimizes: min ||W - dequant(Q)||^2
using half-quadratic splitting with proximal gradient.
"""
K, N = weight.shape
num_groups = K // group_size
for g in range(num_groups):
group = weight[g*group_size:(g+1)*group_size, :].clone()
# Initialize scale and zero-point
gmin = group.min(dim=0).values
gmax = group.max(dim=0).values
num_levels = 2**bits - 1
scale = (gmax - gmin) / num_levels
zero = -gmin / scale
# Iterative refinement (half-quadratic splitting)
for iteration in range(num_iters):
# Step 1: Quantize with current scale/zero
q = torch.round(group / scale + zero).clamp(0, num_levels)
# Step 2: Optimize scale and zero to minimize MSE
# This is a least-squares problem for each output channel
q_float = q.float()
# scale_new, zero_new = solve_least_squares(group, q_float)
numerator = (group * q_float).sum(dim=0) - group.sum(dim=0) * q_float.mean(dim=0)
denominator = (q_float**2).sum(dim=0) - q_float.sum(dim=0) * q_float.mean(dim=0)
scale = numerator / denominator.clamp(min=1e-10)
zero = (group.sum(dim=0) - scale * q_float.sum(dim=0)) / group_size
# Store final quantized values
weight[g*group_size:(g+1)*group_size, :] = scale * (q - zero)
return weight
Sub-4-Bit Methods Quality Comparison (Llama 2 13B, WikiText-2 PPL)
| Method | 2-bit PPL | 3-bit PPL | 4-bit PPL | Needs Calibration | Speed |
|---|---|---|---|---|---|
| RTN (round to nearest) | 44.2 | 8.12 | 5.68 | No | Fast (minutes) |
| GPTQ | 18.5 | 5.82 | 5.28 | Yes (128 samples) | Slow (hours) |
| QuIP# | 6.84 | 5.45 | 5.15 | Yes | Very slow |
| AQLM | 7.12 | 5.52 | 5.18 | Yes | Very slow |
| HQQ | 8.45 | 5.65 | 5.22 | No | Fast (minutes) |
| FP16 baseline | - | - | 5.09 | N/A | N/A |
When Sub-4-Bit Is Viable
Sub-4-bit quantization is appropriate in specific scenarios:
# Decision framework for sub-4-bit quantization
def should_use_sub4bit(
model_size_b,
target_gpu_vram_gb,
quality_tolerance_pct,
use_case
):
"""Determine if sub-4-bit quantization is appropriate."""
fp16_size_gb = model_size_b * 2 / 1e9
int4_size_gb = model_size_b * 0.5 / 1e9 * 1.1 # +10% overhead
if int4_size_gb <= target_gpu_vram_gb * 0.7:
return "INT4 fits comfortably. Use INT4 (GPTQ/AWQ)."
if quality_tolerance_pct < 10:
return "Quality tolerance too tight for sub-4-bit. Use INT4 with TP."
bits_needed = target_gpu_vram_gb * 0.7 * 8 / (model_size_b / 1e9)
if bits_needed < 1.5:
return "Model too large even at 1.58 bits. Need multiple GPUs."
if use_case == "chat":
return f"Try {bits_needed:.1f}-bit (QuIP# or HQQ). Chat tolerates more error."
elif use_case == "code":
return "Code generation is sensitive. Consider INT4 with TP instead."
elif use_case == "summarization":
return f"Try {bits_needed:.1f}-bit. Summarization is moderately tolerant."
return f"Evaluate {bits_needed:.1f}-bit with your specific benchmark."
A 70B model at 2 bits stores roughly the same information as a 13B model at FP16. In many benchmarks, the 13B FP16 model actually outperforms the 70B 2-bit model because it has better-preserved parameters. Before using sub-4-bit quantization on a large model, benchmark against a smaller model at higher precision.
Summary
Sub-4-bit quantization is an active research frontier. At 3 bits, QuIP# and AQLM maintain usable quality with 20-30% model size reduction versus INT4. At 2 bits, quality degradation becomes significant for demanding tasks but may be acceptable for casual chat applications. Ternary networks (1.58 bits) require training from scratch — BitNet 1.58b demonstrates that ternary training is viable, but no ternary model matches FP16 quality at the same effective information content. Binary (1-bit) networks remain impractical for LLMs. The fundamental constraint is information-theoretic: below ~4 bits, you are discarding meaningful information about the weight distribution, and no algorithm can fully recover what was lost. Hardware support for sub-4-bit datatypes would dramatically improve inference speed, but commercial chips are unlikely before 2027.