Part of Series Inference Optimization Timeline 53 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

The Promise That Pruning Never Delivered

For over a decade, model pruning has been one of the most intellectually compelling ideas in deep learning optimization. The premise is elegant: neural networks are massively over-parameterized, so we should be able to remove 90% of the weights and retain most of the accuracy. The research literature is filled with papers demonstrating exactly this claim. Yet if you look at how models are actually deployed in production today, you will find quantization everywhere, pruning almost nowhere, and knowledge distillation occupying a meaningful but specialized niche.

This is not an accident. It is the result of a fundamental mismatch between what pruning produces and what hardware can efficiently execute. Understanding this mismatch — and the exceptions where pruning genuinely works — is essential for any ML engineer making compression decisions.

This article dissects the full landscape of model compression: why pruning underdelivered on its theoretical promise, how quantization became the dominant technique, where knowledge distillation fits in, and the specific conditions under which pruning is still the right choice.

Why Pruning Has Not Become Mainstream

The Core Problem: Hardware Does Not Support Irregular Sparsity

The most common form of pruning is unstructured (fine-grained) magnitude pruning. You rank every weight by its absolute value, zero out the smallest ones, and retrain. Research papers routinely achieve 90%+ sparsity with minimal accuracy loss on benchmarks like ImageNet.

The problem is that zeroing out 90% of the weights in an unstructured pattern does not translate to a 10x speedup on any real hardware. Here is why:

GPUs execute dense matrix multiplications. A GPU’s compute units are designed around GEMM (General Matrix Multiply) operations that operate on dense, contiguous blocks of memory. When you set individual weights to zero in a random scatter pattern, the GPU still has to:

  1. Load the full weight matrix from memory (the zeros still occupy space unless you use a sparse format)
  2. Perform multiplications with those zeros (which produce zero, wasting cycles)
  3. Or use a sparse matrix format like CSR/COO, which adds indexing overhead that often exceeds the savings from skipping zero multiplications

The result is that 90% unstructured sparsity on an NVIDIA GPU typically yields only 1.0-1.5x speedup compared to the dense baseline. Sometimes it is actually slower due to sparse format overhead.

import torch
import time

def benchmark_sparse_vs_dense(M=4096, K=4096, N=4096, sparsity=0.9):
    """
    Demonstrate the gap between theoretical and actual speedup
    from unstructured sparsity on GPU.
    """
    # Dense matrix multiply
    A = torch.randn(M, K, device='cuda', dtype=torch.float16)
    B = torch.randn(K, N, device='cuda', dtype=torch.float16)

    # Create sparse version (90% zeros)
    mask = (torch.rand(K, N, device='cuda') > sparsity).half()
    B_sparse = B * mask

    # Warmup
    for _ in range(20):
        _ = torch.mm(A, B)
        _ = torch.mm(A, B_sparse)

    torch.cuda.synchronize()

    # Benchmark dense
    start = time.perf_counter()
    for _ in range(100):
        _ = torch.mm(A, B)
    torch.cuda.synchronize()
    dense_time = (time.perf_counter() - start) / 100

    # Benchmark "sparse" (still dense format, just zeros in values)
    start = time.perf_counter()
    for _ in range(100):
        _ = torch.mm(A, B_sparse)
    torch.cuda.synchronize()
    sparse_time = (time.perf_counter() - start) / 100

    # The speedup is essentially 1.0x because the GPU does
    # the same number of FLOPs regardless of zero values
    return {
        'dense_ms': dense_time * 1000,
        'sparse_ms': sparse_time * 1000,
        'speedup': dense_time / sparse_time
    }
⚠️ The Sparsity Illusion

A model with 90% sparsity has 10x fewer non-zero parameters, but on standard GPU hardware, inference latency barely changes. The theoretical compression does not become practical speedup without hardware that can skip zero computations.

CPUs Are Not Much Better

On CPUs, the story is slightly more favorable because CPUs have better branch prediction and can benefit from sparse matrix formats in certain regimes. Intel MKL and oneDNN have sparse GEMM kernels that can achieve 2-3x speedup at very high sparsity levels (95%+). However, these speedups:

  • Only materialize at extreme sparsity where accuracy degrades significantly
  • Depend on the specific sparsity pattern being amenable to the CSR format
  • Are still far below the theoretical maximum
📊

Unstructured Sparsity: Theory vs Reality

Sparsity LevelTheoretical SpeedupGPU Actual SpeedupCPU Actual SpeedupTypical Accuracy Loss
50% 2.0x 1.0x 1.0-1.2x 0.1-0.3%
80% 5.0x 1.0-1.1x 1.2-1.5x 0.3-0.8%
90% 10.0x 1.0-1.5x 1.5-2.0x 0.5-2.0%
95% 20.0x 1.1-1.8x 2.0-3.0x 1.0-5.0%
99% 100.0x 1.5-2.5x 3.0-5.0x 3.0-15.0%

The Software Ecosystem Gap

Even if hardware could execute sparse operations efficiently, the software ecosystem is not built for it. PyTorch’s torch.sparse module is limited and poorly optimized. TensorFlow’s sparse support is similarly incomplete. ONNX Runtime has no meaningful sparse execution path. The entire stack — from model definition to training framework to inference runtime to hardware — assumes dense tensors.

Building a full sparse inference stack requires:

  • A sparse tensor format that the runtime understands
  • Sparse kernels for every operation (not just GEMM)
  • A graph compiler that can reason about sparsity propagation
  • Hardware that can actually benefit from the sparse format

No mainstream deployment stack provides all four of these today.

The Exception: N:M Structured Sparsity on Ampere

How 2:4 Sparsity Works

NVIDIA’s Ampere architecture (A100, released 2020) introduced hardware support for a very specific form of sparsity: 2:4 structured sparsity. In every group of 4 consecutive elements, exactly 2 must be zero. This is a rigid constraint, but the hardware can exploit it efficiently because the sparsity pattern is perfectly regular.

The Sparse Tensor Core on Ampere works by:

  1. Storing only the 2 non-zero values per group of 4 (50% compression)
  2. Storing a 2-bit index per group indicating which 2 of the 4 positions are non-zero
  3. Performing the matrix multiply using only the non-zero values, with the index to place results correctly

This gives a genuine 2x speedup for FP16 GEMM operations with zero additional software overhead beyond the initial pruning.

import torch
from torch import nn

def apply_2_4_sparsity(weight: torch.Tensor) -> torch.Tensor:
    """
    Apply 2:4 structured sparsity to a weight tensor.
    For every group of 4 consecutive elements (along the last dim),
    keep the 2 with largest magnitude, zero the other 2.
    """
    assert weight.shape[-1] % 4 == 0, "Last dimension must be divisible by 4"

    # Reshape to groups of 4
    original_shape = weight.shape
    w = weight.reshape(-1, 4)

    # Find top-2 by magnitude in each group
    _, top2_indices = torch.topk(torch.abs(w), k=2, dim=1)

    # Create mask
    mask = torch.zeros_like(w, dtype=torch.bool)
    mask.scatter_(1, top2_indices, True)

    # Apply mask
    pruned = w * mask.float()
    return pruned.reshape(original_shape)

def apply_nm_sparsity(weight: torch.Tensor, n: int, m: int) -> torch.Tensor:
    """
    Generalized N:M sparsity: keep N values in every group of M.
    For Ampere Sparse Tensor Cores, use n=2, m=4.
    """
    assert weight.shape[-1] % m == 0
    original_shape = weight.shape
    w = weight.reshape(-1, m)
    _, topn_indices = torch.topk(torch.abs(w), k=n, dim=1)
    mask = torch.zeros_like(w, dtype=torch.bool)
    mask.scatter_(1, topn_indices, True)
    return (w * mask.float()).reshape(original_shape)
📊

2:4 Structured Sparsity on NVIDIA Ampere (A100)

ModelDense FP16 Latency2:4 Sparse FP16 LatencySpeedupAccuracy Delta
ResNet-50 0.82 ms 0.48 ms 1.71x -0.3%
BERT-Base 4.2 ms 2.5 ms 1.68x -0.4%
GPT-2 Medium 12.1 ms 7.0 ms 1.73x -0.5%
EfficientNet-B4 1.9 ms 1.15 ms 1.65x -0.6%
ViT-Large 8.4 ms 5.1 ms 1.65x -0.4%

Limitations of 2:4 Sparsity

The 2:4 pattern is a genuine win, but it has important constraints:

  • Only 50% sparsity. You cannot achieve 80% or 90% sparsity with this approach. If you need higher compression, 2:4 is not sufficient.
  • Ampere and later only. Older GPUs (V100, T4) have no hardware support. The vast majority of deployed inference hardware is still pre-Ampere.
  • GEMM operations only. Sparse Tensor Cores accelerate matrix multiplies. Convolutions benefit only through im2col, and element-wise operations do not benefit at all.
  • Retraining required. You cannot simply prune a trained model to 2:4 and get good accuracy. Fine-tuning for 10-20% of the original training schedule is typically needed.
  • The 2x ceiling. Even in the best case, the speedup is 2x. Quantization from FP16 to INT8 also gives roughly 2x speedup and is much simpler to apply.
💡 When 2:4 Sparsity Wins

The sweet spot for 2:4 sparsity is when you combine it with quantization: a 2:4 sparse INT8 model on Ampere gets roughly 4x speedup over dense FP16. This combination is more attractive than either technique alone.

SparseGPT: Pruning Meets LLMs

One-Shot Pruning Without Retraining

In 2023, Frantar and Alistarh introduced SparseGPT, which showed that large language models can be pruned to 50-60% unstructured sparsity in a single shot without any retraining. The key insight is that LLMs are so heavily over-parameterized that careful, layer-by-layer pruning guided by Hessian information can remove weights without catastrophic accuracy loss.

SparseGPT works by:

  1. Processing the model layer by layer
  2. For each layer, computing an approximate inverse Hessian using a small calibration set (128 examples)
  3. Using the Hessian to determine which weights to prune and how to adjust remaining weights to compensate
  4. The entire process takes minutes, not days
# Pseudocode for the SparseGPT algorithm
def sparse_gpt_layer(W, X, sparsity_target):
    """
    Prune a single layer's weight matrix W given input activations X.

    W: weight matrix (d_out x d_in)
    X: calibration input activations (n_samples x d_in)
    sparsity_target: fraction of weights to remove (e.g., 0.5)
    """
    # Compute Hessian approximation: H = X^T X + lambda * I
    H = X.T @ X
    H += 1e-6 * torch.eye(H.shape[0], device=H.device)

    # Cholesky decomposition for efficient inverse
    H_inv = torch.linalg.cholesky(H)
    H_inv = torch.cholesky_inverse(H_inv)

    # Process columns in blocks
    block_size = 128
    n_cols = W.shape[1]

    for col_start in range(0, n_cols, block_size):
        col_end = min(col_start + block_size, n_cols)
        W_block = W[:, col_start:col_end].clone()
        H_block = H_inv[col_start:col_end, col_start:col_end]

        # Determine which weights to prune in this block
        # based on magnitude / Hessian diagonal ratio
        scores = W_block ** 2 / H_block.diag().unsqueeze(0)
        threshold = torch.quantile(scores.flatten(), sparsity_target)
        mask = scores > threshold

        # Prune and update remaining weights to compensate
        pruned = W_block * (~mask).float()
        error = (W_block - pruned) @ H_block
        W[:, col_start:col_end] = pruned
        # Propagate error to subsequent columns
        W[:, col_end:] -= error @ H_inv[col_start:col_end, col_end:]

    return W
📊

SparseGPT Results on LLMs

ModelSparsityDense PerplexitySparse PerplexityPerplexity Increase
OPT-1.3B 50% 14.62 15.52 +0.90
OPT-6.7B 50% 10.86 11.22 +0.36
OPT-30B 50% 9.56 9.78 +0.22
OPT-66B 50% 9.34 9.48 +0.14
LLaMA-7B 50% 5.68 6.12 +0.44
LLaMA-30B 50% 4.10 4.24 +0.14

The Catch: Still No Hardware Speedup

SparseGPT is impressive research, but it runs into the same hardware wall. The 50% unstructured sparsity it produces does not speed up inference on current GPUs. The authors acknowledge this and propose combining SparseGPT with 2:4 structured sparsity (SparseGPT can optimize for the 2:4 pattern), which does yield real speedups on Ampere.

However, for LLMs specifically, the primary constraint is usually memory (fitting the model on GPUs), not compute. And for memory reduction, quantization is more effective: GPTQ quantizes to 4-bit with comparable accuracy loss, achieving 4x memory reduction vs. 2x from 50% sparsity.

Knowledge Distillation: The Alternative Path

How Distillation Works

Knowledge distillation takes a completely different approach to compression. Instead of removing parts of a large model, you train a smaller model to mimic the large model’s behavior. The “teacher” (large model) generates soft probability distributions over outputs, and the “student” (small model) learns from both the soft targets and the ground truth labels.

The key equation combines two loss terms:

L=αLCE(y,σ(zs))+(1α)T2LKL(σ(zt/T),σ(zs/T))L = \alpha \cdot L_{CE}(y, \sigma(z_s)) + (1 - \alpha) \cdot T^2 \cdot L_{KL}(\sigma(z_t / T), \sigma(z_s / T))

Where LCEL_{CE} is cross-entropy with hard labels, LKLL_{KL} is KL divergence between teacher and student soft outputs, TT is the temperature parameter that softens probability distributions, and α\alpha balances the two losses.

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.5):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha
        self.ce_loss = nn.CrossEntropyLoss()
        self.kl_loss = nn.KLDivLoss(reduction='batchmean')

    def forward(self, student_logits, teacher_logits, labels):
        # Hard label loss
        hard_loss = self.ce_loss(student_logits, labels)

        # Soft label loss (knowledge distillation)
        T = self.temperature
        soft_student = F.log_softmax(student_logits / T, dim=-1)
        soft_teacher = F.softmax(teacher_logits / T, dim=-1)
        soft_loss = self.kl_loss(soft_student, soft_teacher) * (T * T)

        return self.alpha * hard_loss + (1 - self.alpha) * soft_loss

def distill_model(teacher, student, train_loader, epochs=10,
                  temperature=4.0, alpha=0.5, lr=1e-3):
    """
    Train a student model using knowledge distillation from a teacher.
    """
    teacher.eval()
    student.train()
    optimizer = torch.optim.AdamW(student.parameters(), lr=lr)
    criterion = DistillationLoss(temperature=temperature, alpha=alpha)

    for epoch in range(epochs):
        total_loss = 0.0
        for batch_idx, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.cuda(), labels.cuda()

            # Teacher forward (no gradient needed)
            with torch.no_grad():
                teacher_logits = teacher(inputs)

            # Student forward
            student_logits = student(inputs)

            loss = criterion(student_logits, teacher_logits, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

    return student

Distillation in the LLM Era

Knowledge distillation has seen a resurgence with LLMs. Many of the most capable open-weight models are actually distilled:

  • Alpaca was distilled from GPT-3.5 outputs
  • Vicuna was trained on ShareGPT conversations (a form of distillation)
  • Phi-2 (2.7B) was distilled from larger models using carefully curated synthetic data
  • Mistral models use distillation as part of their training recipe

For LLMs, distillation often means generating training data from a large teacher model rather than using the traditional soft-label approach. This is sometimes called “data distillation” and avoids the need to run teacher and student simultaneously during training.

📊

Knowledge Distillation Results

Student ModelTeacher ModelStudent SizeTeacher AccuracyStudent AccuracyCompression
DistilBERT BERT-Base 66M 88.9% (GLUE) 86.9% (GLUE) 1.7x smaller, 1.6x faster
TinyBERT BERT-Base 14.5M 88.9% 86.4% 7.5x smaller, 9.4x faster
MiniLM BERT-Base 22M 88.9% 87.5% 5x smaller, 5x faster
Phi-2 (2.7B) GPT-4 (data) 2.7B 86.1% (MMLU) 56.7% (MMLU) ~600x smaller
Gemma-2B Larger Gemma 2B 64.3% (MMLU) 42.3% (MMLU) ~13x smaller

Why Distillation Works Differently Than Pruning

Distillation and pruning solve the compression problem in fundamentally different ways:

Pruning keeps the same architecture and removes weights. The resulting model has the same number of layers, the same hidden dimensions, and the same operation graph — just with zeros scattered through the weight matrices. This means pruning cannot reduce the number of operations unless hardware can skip zero computations.

Distillation creates a genuinely smaller architecture: fewer layers, smaller hidden dimensions, fewer attention heads. The resulting model requires fewer FLOPs by construction. A 6-layer student runs faster than a 12-layer teacher on every hardware platform, no special sparse support needed.

This is the key advantage of distillation: the speedup is guaranteed by architecture, not dependent on hardware support for sparsity.

Compression Technique Comparison: Speedup vs Accuracy

(x Speedup (higher is better))
📊 bar chart (x Speedup (higher is better))

Why Quantization Won

The Hardware Alignment Story

Quantization reduces the numerical precision of weights and activations, typically from FP32 or FP16 to INT8 or INT4. Unlike pruning, quantization maps perfectly onto existing hardware capabilities:

  1. Every modern processor has integer ALUs. INT8 multiply-accumulate is faster and more power-efficient than FP16 on GPUs, CPUs, and accelerators.
  2. No sparse indexing overhead. All values are present, just in lower precision. The memory layout is the same dense format hardware already optimizes for.
  3. Linear memory reduction. Going from FP16 to INT8 cuts memory in half. Going to INT4 cuts it by 4x. This is reliable and predictable.
  4. Mature software ecosystem. TensorRT, ONNX Runtime, OpenVINO, Core ML, and TFLite all have robust quantization support. The tooling just works.
📊

Quantization vs Pruning vs Distillation: Practical Comparison

TechniqueMemory ReductionActual Speedup (GPU)Accuracy LossHardware RequirementsTooling Maturity
INT8 Quantization 2x 1.5-2.5x 0.1-1.0% Any modern GPU/CPU Excellent
INT4 Quantization (GPTQ) 4x 2-3x 0.5-2.0% Any GPU Good
50% Unstructured Pruning 1x (no format change) 1.0-1.1x 0.1-0.5% No benefit on standard HW Poor
90% Unstructured Pruning ~2x (with CSR) 1.0-1.5x 0.5-3.0% Needs sparse HW/SW Poor
2:4 Structured Pruning 1.5x 1.6-1.8x 0.3-0.6% NVIDIA Ampere+ Moderate
Distillation (2x smaller) 2x 2x 1-3% Any hardware Good

The Quantization Toolkit Today

The quantization ecosystem has matured dramatically:

  • GPTQ (2022): Post-training quantization to 4-bit for LLMs, using Hessian-based weight rounding. Widely supported in vLLM, TGI, and llama.cpp.
  • AWQ (2023): Activation-aware weight quantization that protects salient weights. Often slightly better accuracy than GPTQ.
  • GGUF / llama.cpp: Ecosystem for running quantized LLMs on CPUs. Supports 2-bit through 8-bit with various quantization schemes.
  • bitsandbytes NF4: 4-bit NormalFloat quantization for QLoRA training. Enables fine-tuning 65B models on a single 48GB GPU.
  • FP8 (Hopper): NVIDIA H100 supports FP8 natively, giving 2x throughput over FP16 with minimal accuracy loss. This is the new default for LLM training.
# Example: Quantizing a model with GPTQ (using auto-gptq library)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

def quantize_llm_gptq(model_name, bits=4):
    """
    Quantize an LLM to 4-bit using GPTQ.
    This produces a model that runs 2-3x faster and uses 4x less memory.
    """
    quantize_config = BaseQuantizeConfig(
        bits=bits,
        group_size=128,     # Quantize in groups of 128 for better accuracy
        desc_act=True,      # Use activation order for better accuracy
        damp_percent=0.01,  # Dampening for Hessian computation
    )

    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config,
    )

    # Quantize using calibration data
    # calibration_data is a list of tokenized examples
    model.quantize(calibration_data)

    # Save the quantized model
    model.save_quantized(f"{model_name}-gptq-{bits}bit")

    return model

Quantization + Sparsity: The Best of Both Worlds

The most promising direction combines quantization with structured sparsity. On NVIDIA Ampere and Hopper GPUs:

Effective Speedup=Quantization Speedup×Sparsity Speedup\text{Effective Speedup} = \text{Quantization Speedup} \times \text{Sparsity Speedup}

A 2:4 sparse INT8 model gets approximately 2x×2x=4x2x \times 2x = 4x speedup over dense FP16. This combination is the closest thing to “free performance” available today.

📊

Combined Quantization + Sparsity on A100

ConfigurationMemoryCompute ThroughputAccuracy (ResNet-50 Top-1)
Dense FP16 Baseline 1.0x 76.1%
Dense INT8 0.5x 2.0x 75.8%
2:4 Sparse FP16 0.75x 1.7x 75.8%
2:4 Sparse INT8 0.375x 3.4x 75.4%
Dense INT4 (GPTQ) 0.25x 2.5x 74.2%

When Pruning Still Matters

Despite the hardware challenges, there are specific scenarios where pruning is the right choice:

Structured pruning — removing entire channels, attention heads, or layers — produces genuinely smaller dense models. This avoids the sparse execution problem entirely because the pruned model is just a smaller dense model.

def structured_prune_attention_heads(model, importance_scores, prune_ratio=0.25):
    """
    Remove entire attention heads based on importance scores.
    The result is a smaller dense model that runs faster on all hardware.
    """
    n_heads = model.config.num_attention_heads
    n_prune = int(n_heads * prune_ratio)

    for layer_idx, layer in enumerate(model.transformer.layers):
        # Get head importance for this layer
        head_scores = importance_scores[layer_idx]  # shape: (n_heads,)

        # Find least important heads
        _, prune_indices = torch.topk(head_scores, n_prune, largest=False)

        # Remove head dimensions from Q, K, V, O projections
        head_dim = model.config.hidden_size // n_heads
        keep_mask = torch.ones(n_heads, dtype=torch.bool)
        keep_mask[prune_indices] = False

        # Slice weight matrices to remove pruned heads
        keep_indices = torch.where(keep_mask)[0]
        keep_dims = torch.cat([
            torch.arange(h * head_dim, (h + 1) * head_dim)
            for h in keep_indices
        ])

        for proj in ['q_proj', 'k_proj', 'v_proj']:
            weight = getattr(layer.self_attn, proj).weight
            new_weight = weight.data[keep_dims]
            setattr(layer.self_attn, proj,
                    nn.Linear(weight.shape[1], len(keep_dims), bias=False))
            getattr(layer.self_attn, proj).weight.data = new_weight

    return model

This is essentially neural architecture search with pruning as the search mechanism. The result is not a “sparse” model; it is a smaller model. Tools like torch.nn.utils.prune and the Neural Magic SparseML library support this workflow.

On-Device Deployment With Custom Hardware

Some edge and mobile accelerators have genuine sparse execution support. Apple’s Neural Engine, Qualcomm’s Hexagon DSP, and certain FPGA implementations can benefit from weight sparsity. If your deployment target has sparse hardware support, pruning becomes viable.

Memory-Constrained Scenarios Combined With Quantization

When you need maximum compression and can combine pruning with quantization, the savings stack. A 50% pruned, 4-bit quantized model uses 8x less memory than the FP16 baseline.

Interpretability and Model Understanding

Pruning reveals which parts of a model are important. Even if you do not deploy the pruned model, the pruning analysis tells you about redundancy patterns, critical subnetworks (the Lottery Ticket Hypothesis), and potential for architecture improvements.

📊

When to Use Each Compression Technique

ScenarioBest TechniqueRationaleExpected Outcome
LLM serving on GPU INT4/INT8 Quantization Direct memory and compute reduction 2-4x faster, 2-4x less memory
Edge deployment (NVIDIA) 2:4 Sparsity + INT8 Ampere Sparse Tensor Cores 3-4x speedup
Edge deployment (Apple) Distillation + Quantization Smaller architecture + Core ML INT8 5-10x smaller and faster
Training cost reduction Distillation Train smaller model from scratch 10-100x less training compute
Maximum compression Pruning + Quantization Stacking compression techniques 8-16x compression
CPU inference INT8 Quantization Mature tooling, guaranteed speedup 2-3x faster
Research / understanding Unstructured Pruning Reveals model structure Insights, not speed

The Lottery Ticket Hypothesis: Beautiful Theory, Limited Practice

Frankle and Carlin’s 2019 Lottery Ticket Hypothesis showed that dense networks contain sparse subnetworks (“winning tickets”) that, when trained from their original initialization, match the full network’s accuracy. This was an intellectually stunning result that generated enormous interest.

However, the practical impact has been limited:

  • Finding tickets is expensive. The original method requires training the full network to convergence, pruning, resetting to initial weights, and retraining. This costs 2-3x the compute of normal training.
  • Scaling issues. The hypothesis holds cleanly for small networks (LeNet, small ResNets) but the picture is murkier for large models. Later work showed that “late resetting” (resetting to weights from early training, not initialization) is needed for larger networks.
  • No deployment advantage. The found subnetwork has the same irregular sparsity problem described above.

The Lottery Ticket Hypothesis is important for our theoretical understanding of neural networks, but it has not produced a practical compression pipeline.

Practical Decision Framework

Here is a concrete decision tree for choosing a compression technique:

💡 Compression Decision Guide
  1. Start with INT8 quantization. It is free performance on all modern hardware with minimal accuracy loss. If this is sufficient, stop here.
  2. If you need more compression for LLMs, try INT4 (GPTQ or AWQ). This gives 4x memory reduction and works well for inference.
  3. If deploying on Ampere+ GPUs and need more speed, add 2:4 structured sparsity on top of quantization.
  4. If you have a fixed latency budget and the model is too slow even after quantization, consider distilling into a smaller architecture.
  5. Use unstructured pruning only if you have custom hardware with sparse support, or for research purposes.

Compression Technique Adoption in Production (2024 Survey)

(% of Deployments Using Technique)
📊 bar chart (% of Deployments Using Technique)

Looking Forward: Will Pruning Ever Have Its Day?

Several developments could change the pruning landscape:

Sparse hardware is improving. NVIDIA’s Hopper architecture extends structured sparsity support. Cerebras’s wafer-scale engine has native support for arbitrary sparsity. Intel’s Ponte Vecchio has sparse matrix acceleration. As more hardware supports sparsity, the practical value of pruning increases.

Compiler-driven sparsity. Projects like Triton and the MLIR sparse tensor dialect are building software infrastructure for sparse computation. If compilers can automatically generate efficient sparse kernels, the software gap closes.

LLM efficiency pressure. As LLMs grow to hundreds of billions of parameters, every compression technique becomes more valuable. The combination of pruning, quantization, and distillation may become standard.

Mixture of Experts as “pruning.” MoE models like Mixtral activate only a subset of parameters for each token. This is structurally similar to pruning — only a fraction of the model runs for any given input — but it is built into the architecture rather than applied post-training. MoE may be where the pruning intuition finds its most practical expression.

Conclusion

The model compression landscape today is clear: quantization dominates because it aligns with hardware capabilities. Knowledge distillation occupies a valuable niche for creating genuinely smaller architectures. Pruning, despite beautiful theory and strong research results, remains limited by the fundamental mismatch between irregular sparsity and dense hardware execution.

The exception is structured sparsity on specific hardware (NVIDIA Ampere’s 2:4 pattern), which delivers real speedups but only 50% compression. For most practitioners, the right approach is: quantize first, distill if you need a smaller architecture, and consider structured pruning only on compatible hardware as an additional optimization.

The research trajectory suggests that pruning’s day may eventually come as hardware evolves. But for production deployments today, quantization has won, and the evidence is overwhelming.