Part of Series Quantization Masterclass 29 of 30
1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest — Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier — Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 — How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization — The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization — OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment

INT4 quantization is mature and production-ready. But the fundamental question remains: how low can you go? At 4 bits, a 70B model fits on a single 80 GB GPU. At 2 bits, it would fit on a 24 GB RTX 4090. At 1.58 bits (ternary), it would require just 14 GB. At 1 bit (binary), just 9 GB. Each halving of precision opens new deployment targets — but at what quality cost? This post surveys the frontier of sub-4-bit quantization: the methods (QuIP#, AQLM, BitNet, HQQ), the theoretical limits, the quality-compression trade-offs, and the hardware that would make extreme quantization practical.

The Precision Spectrum

📊

Model Size by Precision (70B Parameter Model)

PrecisionBits/WeightModel SizeFits OnQuality (PPL delta)Status (2025)
FP16 16 140 GB 2x H100 (TP) Baseline Production
FP8 8 70 GB 1x H100 +0.1-0.3% Production
INT4 (GPTQ/AWQ) 4.2 37 GB 1x A100-40GB +3-8% Production
3-bit (AQLM) 3.1 27 GB 1x RTX 4090 (24GB) +8-15% Research
2-bit (QuIP#) 2.1 18 GB 1x RTX 3090 (24GB) +15-40% Research
Ternary (BitNet 1.58b) 1.58 14 GB 1x RTX 3060 (12GB) +5-25%* Research
Binary (1-bit) 1.0 9 GB 1x RTX 3060 (12GB) +40-100% Experimental
Note: *BitNet 1.58b quality depends on training from scratch with ternary weights, not post-training quantization. Post-training 1.58-bit quantization degrades quality severely.

Model Size Compression vs FP16 (70B Model)

(GB)
FP16
140 GB
INT8
70 GB
INT4
37 GB
3-bit
27 GB
2-bit
18 GB
1.58-bit (ternary)
14 GB
1-bit (binary)
9 GB

3-Bit Quantization: AQLM and QuIP

AQLM (Additive Quantization for Language Models)

AQLM uses vector quantization: instead of quantizing individual weights, it groups weights into vectors and maps each vector to its nearest entry in a learned codebook.

# AQLM quantization concept
# Instead of: weight -> scale * int4_value
# AQLM uses: weight_vector -> codebook[index]

import torch

class AQLMQuantizer:
    def __init__(self, codebook_size=256, vector_dim=8):
        """
        codebook_size: 256 entries = 8-bit index per vector
        vector_dim: 8 weights per vector
        Effective bits per weight: 8 / 8 = 1.0 bits
        With 2 codebooks: 16 / 8 = 2.0 bits
        With 3 codebooks: 24 / 8 = 3.0 bits
        """
        self.codebook_size = codebook_size
        self.vector_dim = vector_dim
        self.codebook = None  # Learned during calibration

    def learn_codebook(self, weight_matrix):
        """Learn codebook entries using k-means on weight vectors."""
        K, N = weight_matrix.shape
        # Reshape into vectors
        num_vectors = (K * N) // self.vector_dim
        vectors = weight_matrix.reshape(num_vectors, self.vector_dim)

        # K-means clustering
        codebook, assignments = kmeans(
            vectors, self.codebook_size, max_iter=100
        )
        self.codebook = codebook  # [codebook_size, vector_dim]
        return assignments  # [num_vectors] indices into codebook

    def dequantize(self, indices, shape):
        """Reconstruct weight matrix from codebook indices."""
        vectors = self.codebook[indices]  # [num_vectors, vector_dim]
        return vectors.reshape(shape)

    def effective_bits(self, num_codebooks=2):
        """Calculate effective bits per weight."""
        import math
        index_bits = math.log2(self.codebook_size)  # 8 bits for 256 entries
        return (num_codebooks * index_bits) / self.vector_dim

AQLM with 2 codebooks achieves approximately 2 bits per weight. With 3 codebooks, it achieves 3 bits. The quality is significantly better than naive round-to-nearest at the same bit width because the codebook entries are optimized to minimize reconstruction error.

QuIP# (Quantization with Incoherence Processing)

QuIP# applies a random orthogonal rotation to the weight matrix before quantization. This “incoherence processing” spreads outlier energy across all weights, making the weight distribution more uniform and easier to quantize at low bit widths.

def quip_incoherence_transform(weight, hadamard_dim=None):
    """Apply incoherence processing via random Hadamard rotation.

    The key insight: if W has outlier columns, H @ W @ H^T
    (where H is a random orthogonal matrix) spreads those outliers
    across all columns, reducing max-to-mean ratio.
    """
    K, N = weight.shape

    # Generate random Hadamard-like rotation
    # In practice, use the Walsh-Hadamard transform (O(N log N))
    # with random sign flips for randomization
    signs_k = (torch.randint(0, 2, (K,)) * 2 - 1).float()
    signs_n = (torch.randint(0, 2, (N,)) * 2 - 1).float()

    # Apply: W_rotated = diag(signs_k) @ H_k @ W @ H_n^T @ diag(signs_n)
    w_rotated = weight * signs_k.unsqueeze(1)
    w_rotated = hadamard_transform(w_rotated, dim=0)
    w_rotated = w_rotated * signs_n.unsqueeze(0)
    w_rotated = hadamard_transform(w_rotated, dim=1)

    return w_rotated, signs_k, signs_n

def quip_quantize(weight, bits=2, lattice='E8'):
    """Quantize using QuIP# with lattice codebook."""
    # Step 1: Incoherence transform
    w_rotated, signs_k, signs_n = quip_incoherence_transform(weight)

    # Step 2: Quantize to lattice points
    # E8 lattice: 8-dimensional lattice with 240 nearest neighbors
    # Each 8-dimensional vector maps to the nearest E8 lattice point
    # Encoding: ~1-2 bits per dimension
    if lattice == 'E8':
        # E8 lattice quantization
        quantized = e8_lattice_quantize(w_rotated, bits)
    else:
        # Standard round-to-nearest
        quantized = round_to_nearest(w_rotated, bits)

    return quantized, signs_k, signs_n
ℹ️ Why Lattices Beat Per-Element Quantization

A lattice codebook in dd dimensions has exponentially more entries than dd independent scalar codebooks. The E8 lattice in 8 dimensions provides 240 nearest-neighbor codewords at ~1 bit/dimension, achieving better rate-distortion trade-offs than per-element quantization. This is the mathematical reason QuIP# outperforms naive 2-bit quantization by a wide margin.

📊

3-Bit and 2-Bit Quality Comparison (Llama 2 70B)

MethodEffective BitsWikiText PPLDelta vs FP16MMLU AccQuantization Time
FP16 (baseline) 16.0 3.32 0% 68.9% N/A
GPTQ INT4 g128 4.2 3.58 +7.8% 63.2% 4 hours
AQLM 3-bit (2 codebooks) 3.0 3.72 +12.0% 60.8% 12 hours
AQLM 2-bit (2 codebooks) 2.0 4.45 +34.0% 54.2% 12 hours
QuIP# 3-bit 3.0 3.65 +9.9% 61.5% 8 hours
QuIP# 2-bit 2.0 4.15 +25.0% 56.8% 8 hours
Naive RTN 2-bit 2.0 12.4 +273% 28.1% 10 min
Note: QuIP# at 2 bits significantly outperforms naive round-to-nearest (RTN). But quality loss is still substantial for demanding tasks.

Ternary Networks: BitNet 1.58b

BitNet 1.58b restricts weights to three values: 1. This is 1.58 bits per weight (log2(3)=1.585\log_2(3) = 1.585). The critical difference from post-training quantization: BitNet trains from scratch with ternary constraints.

BitNet Architecture

class BitLinear(torch.nn.Module):
    """BitNet 1.58b linear layer with ternary weights."""

    def __init__(self, in_features, out_features):
        super().__init__()
        # Ternary weights: stored as int8 but values are {-1, 0, 1}
        self.weight = torch.nn.Parameter(
            torch.randint(-1, 2, (out_features, in_features)).float()
        )
        self.scale = torch.nn.Parameter(torch.ones(1))

    def ternarize_weight(self):
        """Quantize weights to {-1, 0, +1} during forward pass."""
        # Absmean quantization
        gamma = self.weight.abs().mean()
        w_ternary = torch.sign(self.weight) * (self.weight.abs() > gamma * 0.5).float()
        return w_ternary, gamma

    def forward(self, x):
        # Quantize activations to INT8
        x_scale = 127.0 / x.abs().max().clamp(min=1e-5)
        x_quant = (x * x_scale).round().clamp(-128, 127)

        # Ternarize weights
        w_ternary, w_scale = self.ternarize_weight()

        # Matrix multiply: only additions and subtractions!
        # y = x_quant @ w_ternary.T
        # This is NOT a multiply -- it's: sum where w=1, subtract where w=-1, skip where w=0
        y = torch.nn.functional.linear(x_quant, w_ternary)

        # Rescale
        y = y * (w_scale / x_scale) * self.scale
        return y

Compute Implications

The multiplication in ternary matrix multiply degenerates to addition and subtraction:

yi=j:wij=+1xjj:wij=1xjy_i = \sum_{j: w_{ij}=+1} x_j - \sum_{j: w_{ij}=-1} x_j

This eliminates the need for multiply-accumulate (MAC) units entirely. A ternary GEMM requires only integer additions, which are cheaper and faster than FP16 multiplications.

// Ternary GEMM kernel: no multiplies, only add/subtract
__global__ void ternary_gemv(
    const int8_t* __restrict__ x,       // [K] input activations
    const int8_t* __restrict__ w_ternary, // [N, K] ternary weights {-1, 0, +1}
    int32_t* __restrict__ y,             // [N] output
    int K, int N
) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= N) return;

    int32_t acc = 0;
    for (int k = 0; k < K; k++) {
        int8_t w = w_ternary[row * K + k];
        if (w == 1) {
            acc += x[k];        // Addition only
        } else if (w == -1) {
            acc -= x[k];        // Subtraction only
        }
        // w == 0: skip (sparsity)
    }
    y[row] = acc;
}

// Optimized: pack ternary as 2-bit and use bit manipulation
// 00 = 0, 01 = +1, 11 = -1
__global__ void ternary_gemv_packed(
    const int8_t* __restrict__ x,
    const uint32_t* __restrict__ w_packed,  // 16 ternary values per uint32
    int32_t* __restrict__ y,
    int K, int N
) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= N) return;

    int32_t acc = 0;
    for (int k_packed = 0; k_packed < K / 16; k_packed++) {
        uint32_t packed = w_packed[row * (K/16) + k_packed];
        #pragma unroll
        for (int i = 0; i < 16; i++) {
            uint8_t bits = (packed >> (i * 2)) & 0x3;
            int k = k_packed * 16 + i;
            if (bits == 0x1) acc += x[k];       // +1
            else if (bits == 0x3) acc -= x[k];  // -1
            // bits == 0x0: zero, skip
        }
    }
    y[row] = acc;
}

Theoretical Compute Cost per Token (Llama 70B equivalent)

(TOPS required)
FP16 (MAC)
140 TOPS required
INT4 (dequant + MAC)
45 TOPS required
Ternary (add/sub only) 7.8x less compute
18 TOPS required
Binary (XNOR + popcount) 15.6x less compute
9 TOPS required

Binary Neural Networks

Binary networks restrict weights to 1 (1 bit per weight). The matrix multiply becomes an XNOR operation followed by a population count (popcount):

yi=N2×popcount(xbinwi,bin)y_i = N - 2 \times \text{popcount}(x_{\text{bin}} \oplus w_{i,\text{bin}})

where \oplus is XNOR and NN is the vector dimension.

// Binary GEMV: XNOR + popcount
// Pack 32 binary weights into a single uint32
__global__ void binary_gemv(
    const uint32_t* __restrict__ x_packed,  // [K/32] binary activations
    const uint32_t* __restrict__ w_packed,  // [N, K/32] binary weights
    int32_t* __restrict__ y,
    int K, int N
) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= N) return;

    int32_t acc = 0;
    for (int k32 = 0; k32 < K / 32; k32++) {
        uint32_t xb = x_packed[k32];
        uint32_t wb = w_packed[row * (K/32) + k32];
        uint32_t xnor = ~(xb ^ wb);  // XNOR: 1 where bits match
        acc += __popc(xnor);          // Count matching bits
    }
    // Convert popcount to signed dot product:
    // dot = 2 * popcount - K
    y[row] = 2 * acc - K;
}

Binary networks are theoretically 32x more memory-efficient than FP32 and require no multiplications. But quality degradation for LLMs is severe — no post-training binary quantization of a pretrained LLM produces usable output. Binary LLMs must be trained from scratch, and even then, quality lags significantly behind FP16 models of the same parameter count.

⚠️ Binary LLMs Are Not Viable (Yet)

As of 2025, no binary (1-bit) LLM achieves competitive quality on standard benchmarks. The information capacity of 1 bit per weight is fundamentally limited by the rate-distortion bound. A 70B binary model stores 70 billion bits = 8.75 GB of information, equivalent to a 4.4B FP16 model. The parameter count is misleading — it is the total information content that determines capability.

Information-Theoretic Limits

The fundamental question: what is the minimum number of bits needed to represent a neural network without quality loss?

The rate-distortion bound from information theory states:

R(D)=12log2σW2DR(D) = \frac{1}{2} \log_2 \frac{\sigma_W^2}{D}

where σW2\sigma_W^2 is the variance of the weight distribution and DD is the allowed distortion (mean squared error). For a typical LLM weight distribution with σW0.01\sigma_W \approx 0.01:

import math

def rate_distortion_bound(sigma_w, target_ppl_increase_pct):
    """Estimate minimum bits per weight for given quality target.

    This is a rough estimate -- actual minimum depends on the
    specific model's sensitivity to weight perturbation.
    """
    # Map PPL increase to allowed MSE (empirical relationship)
    # ~1% PPL increase corresponds to ~1e-6 MSE for large models
    allowed_mse = (target_ppl_increase_pct / 100) * 1e-6

    if allowed_mse <= 0 or allowed_mse >= sigma_w**2:
        return float('inf') if allowed_mse <= 0 else 0

    rate = 0.5 * math.log2(sigma_w**2 / allowed_mse)
    return rate

# For various quality targets
sigma_w = 0.01  # Typical weight std dev
for ppl_delta in [1, 5, 10, 20, 50]:
    min_bits = rate_distortion_bound(sigma_w, ppl_delta)
    print(f"PPL increase {ppl_delta}%: minimum ~{min_bits:.1f} bits/weight")
# PPL increase 1%: minimum ~6.7 bits/weight
# PPL increase 5%: minimum ~5.5 bits/weight
# PPL increase 10%: minimum ~5.0 bits/weight
# PPL increase 20%: minimum ~4.4 bits/weight
# PPL increase 50%: minimum ~3.3 bits/weight
📊

Theoretical vs Achieved Compression (70B Model)

Quality Target (PPL delta)Rate-Distortion BoundBest Achieved (2025)MethodGap to Bound
+1% ~6.7 bits 8.0 bits (FP8) FP8 per-channel 1.3 bits
+5% ~5.5 bits 4.2 bits (AWQ INT4) AWQ g128 Below bound*
+10% ~5.0 bits 3.0 bits (QuIP#) QuIP# 3-bit Below bound*
+25% ~4.0 bits 2.0 bits (QuIP#) QuIP# 2-bit Below bound*
+50% ~3.3 bits 1.58 bits (BitNet) BitNet from scratch Below bound*
Note: *Entries below the rate-distortion bound indicate that the bound estimate is conservative (it assumes Gaussian weights and ignores model structure). Real models have redundancy that quantization can exploit.

Hardware for Sub-4-Bit Inference

Current GPUs are not optimized for sub-4-bit datatypes. Tensor Cores support FP16, BF16, FP8, and (on Blackwell) INT4, but not 3-bit, 2-bit, or ternary. Efficient sub-4-bit inference requires specialized hardware.

Ternary Processing Units

A ternary multiply unit replaces the multiplier with a multiplexer:

FP16 Multiply:   16-bit * 16-bit → 32-bit  (hundreds of gates)
INT4 Multiply:    4-bit *  4-bit →  8-bit   (tens of gates)
Ternary "Multiply": select(+x, 0, -x)       (2 gates: mux + negate)
Binary "Multiply":  XNOR                     (1 gate)

The area and energy savings are dramatic:

📊

Compute Unit Comparison by Precision

OperationGate CountEnergy (pJ)Area (um^2)Relative to FP16
FP16 FMA ~4000 ~3.7 ~2500 1.0x (baseline)
INT8 MAC ~600 ~0.9 ~400 0.15x
INT4 MAC ~200 ~0.3 ~150 0.05x
Ternary (mux) ~10 ~0.02 ~8 0.003x
Binary (XNOR) ~2 ~0.005 ~2 0.0005x
Note: A chip with ternary units could fit 1000x more compute per mm^2 than FP16. But this requires custom silicon -- no such commercial chip exists at scale (2025).

Software Emulation on GPUs

Without native hardware support, sub-4-bit inference on GPUs requires software emulation:

def sub4bit_gemv_throughput(bits, gpu_mem_bw_gb_s=3350, gpu_tops=990):
    """Estimate throughput for sub-4-bit GEMV on H100.

    Sub-4-bit GEMV is memory-bandwidth bound (batch=1).
    Throughput scales with compression ratio.
    """
    # FP16 baseline: 2 bytes per weight
    fp16_bytes_per_weight = 2.0
    sub4_bytes_per_weight = bits / 8.0

    # Memory BW bound throughput
    # For 70B model, single decode step reads all weights
    model_params = 70e9
    fp16_model_bytes = model_params * fp16_bytes_per_weight
    sub4_model_bytes = model_params * sub4_bytes_per_weight

    fp16_time_ms = (fp16_model_bytes / (gpu_mem_bw_gb_s * 1e9)) * 1000
    sub4_time_ms = (sub4_model_bytes / (gpu_mem_bw_gb_s * 1e9)) * 1000

    # Add dequantization overhead (~20% for sub-4-bit)
    sub4_time_ms *= 1.2

    fp16_tok_s = 1000 / fp16_time_ms
    sub4_tok_s = 1000 / sub4_time_ms

    return {
        'fp16_ms': fp16_time_ms,
        'sub4_ms': sub4_time_ms,
        'fp16_tok_s': fp16_tok_s,
        'sub4_tok_s': sub4_tok_s,
        'speedup': fp16_time_ms / sub4_time_ms
    }

for bits in [4, 3, 2, 1.58, 1]:
    r = sub4bit_gemv_throughput(bits)
    print(f"{bits:.2f} bits: {r['sub4_tok_s']:.0f} tok/s "
          f"({r['speedup']:.1f}x vs FP16)")

Theoretical Decode Speed by Bit Width (70B Model, H100, batch=1)

(tokens/sec)
FP16 (16-bit)
24 tokens/sec
INT4 (4-bit)
80 tokens/sec
3-bit
100 tokens/sec
2-bit
140 tokens/sec
1.58-bit (ternary)
162 tokens/sec
1-bit (binary) If quality existed
230 tokens/sec

HQQ: Half-Quadratic Quantization

HQQ is a recent method that achieves competitive sub-4-bit quality without requiring calibration data:

def hqq_quantize(weight, bits=2, group_size=64, num_iters=20):
    """Half-Quadratic Quantization: no calibration data needed.

    Optimizes: min ||W - dequant(Q)||^2
    using half-quadratic splitting with proximal gradient.
    """
    K, N = weight.shape
    num_groups = K // group_size

    for g in range(num_groups):
        group = weight[g*group_size:(g+1)*group_size, :].clone()

        # Initialize scale and zero-point
        gmin = group.min(dim=0).values
        gmax = group.max(dim=0).values
        num_levels = 2**bits - 1
        scale = (gmax - gmin) / num_levels
        zero = -gmin / scale

        # Iterative refinement (half-quadratic splitting)
        for iteration in range(num_iters):
            # Step 1: Quantize with current scale/zero
            q = torch.round(group / scale + zero).clamp(0, num_levels)

            # Step 2: Optimize scale and zero to minimize MSE
            # This is a least-squares problem for each output channel
            q_float = q.float()
            # scale_new, zero_new = solve_least_squares(group, q_float)
            numerator = (group * q_float).sum(dim=0) - group.sum(dim=0) * q_float.mean(dim=0)
            denominator = (q_float**2).sum(dim=0) - q_float.sum(dim=0) * q_float.mean(dim=0)
            scale = numerator / denominator.clamp(min=1e-10)
            zero = (group.sum(dim=0) - scale * q_float.sum(dim=0)) / group_size

        # Store final quantized values
        weight[g*group_size:(g+1)*group_size, :] = scale * (q - zero)

    return weight
📊

Sub-4-Bit Methods Quality Comparison (Llama 2 13B, WikiText-2 PPL)

Method2-bit PPL3-bit PPL4-bit PPLNeeds CalibrationSpeed
RTN (round to nearest) 44.2 8.12 5.68 No Fast (minutes)
GPTQ 18.5 5.82 5.28 Yes (128 samples) Slow (hours)
QuIP# 6.84 5.45 5.15 Yes Very slow
AQLM 7.12 5.52 5.18 Yes Very slow
HQQ 8.45 5.65 5.22 No Fast (minutes)
FP16 baseline - - 5.09 N/A N/A
Note: QuIP# achieves the best quality at 2 bits. HQQ is notable for requiring no calibration data.

When Sub-4-Bit Is Viable

Sub-4-bit quantization is appropriate in specific scenarios:

# Decision framework for sub-4-bit quantization
def should_use_sub4bit(
    model_size_b,
    target_gpu_vram_gb,
    quality_tolerance_pct,
    use_case
):
    """Determine if sub-4-bit quantization is appropriate."""
    fp16_size_gb = model_size_b * 2 / 1e9
    int4_size_gb = model_size_b * 0.5 / 1e9 * 1.1  # +10% overhead

    if int4_size_gb <= target_gpu_vram_gb * 0.7:
        return "INT4 fits comfortably. Use INT4 (GPTQ/AWQ)."

    if quality_tolerance_pct < 10:
        return "Quality tolerance too tight for sub-4-bit. Use INT4 with TP."

    bits_needed = target_gpu_vram_gb * 0.7 * 8 / (model_size_b / 1e9)
    if bits_needed < 1.5:
        return "Model too large even at 1.58 bits. Need multiple GPUs."

    if use_case == "chat":
        return f"Try {bits_needed:.1f}-bit (QuIP# or HQQ). Chat tolerates more error."
    elif use_case == "code":
        return "Code generation is sensitive. Consider INT4 with TP instead."
    elif use_case == "summarization":
        return f"Try {bits_needed:.1f}-bit. Summarization is moderately tolerant."

    return f"Evaluate {bits_needed:.1f}-bit with your specific benchmark."
The Real Competitor Is a Smaller Model

A 70B model at 2 bits stores roughly the same information as a 13B model at FP16. In many benchmarks, the 13B FP16 model actually outperforms the 70B 2-bit model because it has better-preserved parameters. Before using sub-4-bit quantization on a large model, benchmark against a smaller model at higher precision.

Summary

Sub-4-bit quantization is an active research frontier. At 3 bits, QuIP# and AQLM maintain usable quality with 20-30% model size reduction versus INT4. At 2 bits, quality degradation becomes significant for demanding tasks but may be acceptable for casual chat applications. Ternary networks (1.58 bits) require training from scratch — BitNet 1.58b demonstrates that ternary training is viable, but no ternary model matches FP16 quality at the same effective information content. Binary (1-bit) networks remain impractical for LLMs. The fundamental constraint is information-theoretic: below ~4 bits, you are discarding meaningful information about the weight distribution, and no algorithm can fully recover what was lost. Hardware support for sub-4-bit datatypes would dramatically improve inference speed, but commercial chips are unlikely before 2027.