Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact

Part of Series Quantization Masterclass 25 of 30

1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest — Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier — Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 — How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization — The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization — OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment

Most quantization benchmarks are misleading. A common pattern: a paper reports WikiText-2 perplexity for a quantized Llama model, shows a 0.1 PPL increase, and declares the quantization “near-lossless.” But WikiText-2 perplexity measures a narrow distribution of Wikipedia text. It does not capture the model’s ability to follow instructions, generate code, perform multi-step reasoning, or handle multilingual inputs. A 0.1 PPL increase on WikiText-2 can correspond to a 5-10 percentage point drop on GSM8K (math reasoning) or HumanEval (code generation).

Quantization benchmarking requires measuring three axes: quality (does the model still produce correct outputs?), throughput (how many tokens per second per dollar?), and the interaction between them (what is the quality-throughput Pareto frontier?). This post provides a rigorous methodology for all three.

Quality Benchmarking

Perplexity: Necessary but Not Sufficient

Perplexity measures how well the model predicts held-out text. It is the exponential of the average cross-entropy loss:

$\text{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log p(x_i | x_{<i})\right)$

# Correct perplexity measurement for quantized models
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def measure_perplexity(model, tokenizer, dataset, max_length=2048, stride=512):
    """
    Sliding-window perplexity with proper stride handling.

    Common mistakes to avoid:
    1. Not using a sliding window (truncates long documents)
    2. Using stride = max_length (misses cross-boundary predictions)
    3. Not computing per-token loss (aggregating wrong)
    4. Including padding tokens in the loss
    """
    model.eval()
    nlls = []

    for text in dataset:
        encodings = tokenizer(text, return_tensors="pt")
        input_ids = encodings.input_ids.to(model.device)
        seq_len = input_ids.shape[1]

        prev_end_loc = 0
        for begin_loc in range(0, seq_len, stride):
            end_loc = min(begin_loc + max_length, seq_len)
            trg_len = end_loc - prev_end_loc  # Tokens to compute loss on

            input_chunk = input_ids[:, begin_loc:end_loc]

            # Create target labels
            target_ids = input_chunk.clone()
            # Mask out tokens we already computed loss for
            target_ids[:, :-trg_len] = -100

            with torch.no_grad():
                outputs = model(input_chunk, labels=target_ids)
                # outputs.loss is averaged over non-masked tokens
                neg_log_likelihood = outputs.loss * trg_len

            nlls.append(neg_log_likelihood)
            prev_end_loc = end_loc

            if end_loc == seq_len:
                break

    ppl = torch.exp(torch.stack(nlls).sum() / sum_tokens(nlls))
    return ppl.item()

# Critical: use the SAME evaluation code for both FP16 and quantized models
# Any difference in tokenization, context window, or stride invalidates comparison

⚠️ Warning

Perplexity is sensitive to the evaluation dataset and methodology. Differences in stride length, context window handling, and tokenizer configuration can shift perplexity by 0.1-0.5 points, which is the same magnitude as many quantization effects. Always compare models using identical evaluation code running on the same dataset with the same hyperparameters.

Task-Specific Evaluation

Perplexity misses task-specific degradation. A quantized model may have near-identical perplexity but fail on structured tasks:

# Comprehensive task-specific evaluation suite
evaluation_suite = {
    # Reasoning
    "GSM8K": {
        "metric": "exact_match_accuracy",
        "n_shots": 8,
        "description": "Grade school math word problems",
        "sensitive_to_quantization": True,  # Reasoning chains break
    },
    "ARC-Challenge": {
        "metric": "accuracy",
        "n_shots": 25,
        "description": "Science question answering",
        "sensitive_to_quantization": False,  # Multiple choice is robust
    },

    # Code generation
    "HumanEval": {
        "metric": "pass@1",
        "n_shots": 0,
        "description": "Python function completion",
        "sensitive_to_quantization": True,  # Exact syntax matters
    },
    "MBPP": {
        "metric": "pass@1",
        "n_shots": 3,
        "description": "Mostly Basic Python Problems",
        "sensitive_to_quantization": True,
    },

    # Knowledge
    "MMLU": {
        "metric": "accuracy",
        "n_shots": 5,
        "description": "Massive multitask language understanding",
        "sensitive_to_quantization": False,  # MC robust to small noise
    },
    "TriviaQA": {
        "metric": "exact_match",
        "n_shots": 5,
        "description": "Open-domain QA",
        "sensitive_to_quantization": True,  # Exact recall matters
    },

    # Instruction following
    "IFEval": {
        "metric": "instruction_following_rate",
        "n_shots": 0,
        "description": "Instruction following evaluation",
        "sensitive_to_quantization": True,  # Format compliance breaks
    },
    "MT-Bench": {
        "metric": "gpt4_judge_score",
        "n_shots": 0,
        "description": "Multi-turn conversation quality",
        "sensitive_to_quantization": False,  # LLM judge robust to style shifts
    },
}

def run_evaluation_suite(model, tokenizer, suite):
    """Run all benchmarks and return structured results."""
    results = {}
    for task_name, config in suite.items():
        dataset = load_benchmark(task_name)
        predictions = generate_predictions(model, tokenizer, dataset, config)
        score = compute_metric(predictions, dataset, config["metric"])
        results[task_name] = {
            "score": score,
            "n_samples": len(dataset),
            "metric": config["metric"],
        }
    return results

📊

Task-Specific Quality Impact of INT4 Quantization (Llama-2-70B)

Benchmark	FP16	GPTQ INT4	AWQ INT4	Delta (worst)
WikiText-2 PPL	3.32	3.41	3.39	+0.09
GSM8K (accuracy)	56.8%	52.1%	53.4%	-4.7pp
HumanEval (pass@1)	29.9%	26.2%	27.1%	-3.7pp
MMLU (accuracy)	69.8%	69.1%	69.3%	-0.7pp
ARC-Challenge	64.4%	63.8%	64.0%	-0.6pp
IFEval (strict)	52.1%	48.3%	49.5%	-3.8pp
MT-Bench (avg)	7.42	7.31	7.35	-0.11

ℹ️ Note

The pattern is consistent across models: tasks requiring exact reasoning (GSM8K), precise code generation (HumanEval), or strict format compliance (IFEval) are 3-5x more sensitive to quantization than multiple-choice knowledge tasks (MMLU, ARC) or open-ended generation (MT-Bench). If your application involves structured output, code, or math, perplexity alone is an unreliable quality indicator.

Statistical Significance

Small evaluation sets produce noisy estimates. A 2 percentage point difference on HumanEval (164 problems) is not statistically significant:

import numpy as np
from scipy import stats

def is_significant(score_a, score_b, n_samples, confidence=0.95):
    """
    Test whether the difference between two accuracy scores
    is statistically significant using a two-proportion z-test.
    """
    p1 = score_a
    p2 = score_b
    p_pool = (p1 * n_samples + p2 * n_samples) / (2 * n_samples)

    se = np.sqrt(2 * p_pool * (1 - p_pool) / n_samples)
    z = abs(p1 - p2) / se
    p_value = 2 * (1 - stats.norm.cdf(z))

    z_critical = stats.norm.ppf(1 - (1 - confidence) / 2)
    significant = z > z_critical

    return {
        "z_statistic": z,
        "p_value": p_value,
        "significant": significant,
        "min_detectable_difference": z_critical * se,
    }

# HumanEval: 164 problems
result = is_significant(0.299, 0.262, 164)
# z=0.82, p=0.41 -> NOT significant
# Min detectable difference: ±7.0 percentage points at 95% confidence
# HumanEval is too small to detect typical quantization effects!

# GSM8K: 1319 problems
result = is_significant(0.568, 0.521, 1319)
# z=2.54, p=0.011 -> Significant
# Min detectable difference: ±2.7 percentage points at 95% confidence

# MMLU: ~14000 questions across all subjects
result = is_significant(0.698, 0.691, 14000)
# z=0.78, p=0.44 -> NOT significant (0.7pp is within noise)
# Min detectable difference: ±0.8 percentage points at 95% confidence

🚨 Danger

A benchmark must have enough samples to detect the expected effect size. HumanEval (164 problems) cannot reliably detect less than a 7 percentage point difference. MMLU (14K questions) can detect 0.8pp differences. Always report confidence intervals, not just point estimates. When the difference is within the confidence interval, the result is inconclusive, not “near-lossless.”

Throughput Benchmarking

What to Measure

Throughput depends on the workload. The same quantized model shows different speedups for different request patterns:

# Throughput measurement framework
class ThroughputBenchmark:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def measure_decode_throughput(self, batch_sizes, n_warmup=10, n_measure=100):
        """
        Measure pure decode throughput (single token generation).
        This is the memory-bandwidth-bound regime.
        """
        results = {}
        for bs in batch_sizes:
            # Warmup
            for _ in range(n_warmup):
                dummy_input = torch.randint(0, 32000, (bs, 128)).cuda()
                with torch.no_grad():
                    self.model(dummy_input)
            torch.cuda.synchronize()

            # Measure
            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)

            start.record()
            for _ in range(n_measure):
                with torch.no_grad():
                    # Simulate decode: context of 512 tokens, generate 1
                    self.model.generate(
                        dummy_input[:, :512],
                        max_new_tokens=1,
                        do_sample=False,
                    )
            end.record()
            torch.cuda.synchronize()

            elapsed_ms = start.elapsed_time(end)
            tokens_per_sec = (bs * n_measure) / (elapsed_ms / 1000)
            results[bs] = tokens_per_sec

        return results

    def measure_prefill_throughput(self, seq_lengths, batch_size=1):
        """
        Measure prefill throughput (processing input prompt).
        This is the compute-bound regime.
        """
        results = {}
        for seq_len in seq_lengths:
            input_ids = torch.randint(0, 32000, (batch_size, seq_len)).cuda()

            # Warmup
            for _ in range(5):
                with torch.no_grad():
                    self.model(input_ids)
            torch.cuda.synchronize()

            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)

            n_iter = 50
            start.record()
            for _ in range(n_iter):
                with torch.no_grad():
                    self.model(input_ids)
            end.record()
            torch.cuda.synchronize()

            elapsed_ms = start.elapsed_time(end)
            tokens_per_sec = (batch_size * seq_len * n_iter) / (elapsed_ms / 1000)
            results[seq_len] = tokens_per_sec

        return results

    def measure_end_to_end(self, prompts, max_new_tokens=256):
        """
        End-to-end benchmark: prefill + decode with real prompts.
        Most representative of production performance.
        """
        latencies = []
        for prompt in prompts:
            input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.cuda()
            torch.cuda.synchronize()

            start_time = torch.cuda.Event(enable_timing=True)
            end_time = torch.cuda.Event(enable_timing=True)

            start_time.record()
            output = self.model.generate(
                input_ids,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
            )
            end_time.record()
            torch.cuda.synchronize()

            n_generated = output.shape[1] - input_ids.shape[1]
            elapsed_ms = start_time.elapsed_time(end_time)
            latencies.append({
                "prompt_tokens": input_ids.shape[1],
                "generated_tokens": n_generated,
                "total_ms": elapsed_ms,
                "ttft_ms": elapsed_ms * (input_ids.shape[1] / output.shape[1]),
                "tps": n_generated / (elapsed_ms / 1000),
            })

        return latencies

Common Throughput Measurement Mistakes

# MISTAKE 1: Measuring throughput without warmup
# The first few iterations include JIT compilation, memory allocation,
# and CUDA context setup. They can be 2-10x slower.

# MISTAKE 2: Using torch.cuda.synchronize() incorrectly
# BAD: timing includes CPU overhead
import time
start = time.time()
model(input_ids)  # This returns immediately (async!)
torch.cuda.synchronize()  # This waits for GPU
end = time.time()
# The measured time includes CPU-side Python overhead

# GOOD: use CUDA events for GPU-only timing
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
model(input_ids)
end_event.record()
torch.cuda.synchronize()
gpu_time_ms = start_event.elapsed_time(end_event)

# MISTAKE 3: Not controlling for KV cache size
# Decode throughput at position 100 is different from position 4000
# because the KV cache read dominates memory bandwidth at long contexts

# MISTAKE 4: Reporting peak throughput instead of sustained throughput
# Peak: BS=1, short context, ideal conditions
# Sustained: continuous serving with mixed request lengths and batch sizes

# MISTAKE 5: Not controlling GPU clock speed
# GPU may be in a low-power state at the start, then boost
# Use nvidia-smi to lock clocks:
# nvidia-smi -lgc 1980,1980  # Lock graphics clock to max
# nvidia-smi -lmc 1593,1593  # Lock memory clock to max

Latency Percentile Analysis

Mean latency hides the tail. For production serving, P50, P95, and P99 latencies matter:

import numpy as np

def latency_percentile_analysis(latencies_ms):
    """Compute latency percentiles from a list of measurements."""
    arr = np.array(latencies_ms)
    return {
        "P50": np.percentile(arr, 50),
        "P90": np.percentile(arr, 90),
        "P95": np.percentile(arr, 95),
        "P99": np.percentile(arr, 99),
        "P99.9": np.percentile(arr, 99.9),
        "mean": np.mean(arr),
        "std": np.std(arr),
        "min": np.min(arr),
        "max": np.max(arr),
        "n_samples": len(arr),
    }

# Example results for quantized model serving:
# FP16 (Llama-2-7B, BS=1, decode):
# P50=8.2ms, P95=9.1ms, P99=12.8ms, P99.9=45.2ms
#
# INT4 (same model, same conditions):
# P50=4.1ms, P95=4.5ms, P99=5.8ms, P99.9=18.3ms
#
# Key observations:
# Median speedup: 2.0x (expected from 4x weight compression)
# P99 speedup: 2.2x (quantized model has tighter distribution)
# P99.9 speedup: 2.5x (fewer outliers due to less memory pressure)
# The tail latency improvement is larger than the median improvement
#    because the smaller model causes less memory contention

Latency Distribution: FP16 vs INT4 (Llama-2-7B, H100, BS=1 Decode)

(ms)

FP16 P50 Median

8.2 ms

FP16 P99 Tail

12.8 ms

INT4 P50 2.0x faster

4.1 ms

INT4 P99 2.2x faster

5.8 ms

Cost Analysis

Cost-Per-Token Calculation

The true cost of quantization includes GPU cost, throughput, and quality:

def cost_per_million_tokens(
    gpu_hourly_cost,     # $/hour for the GPU instance
    throughput_tps,      # Tokens per second (sustained)
    quality_factor=1.0,  # Multiplier for quality-adjusted cost
):
    """Calculate the cost per million output tokens."""
    tokens_per_hour = throughput_tps * 3600
    cost_per_token = gpu_hourly_cost / tokens_per_hour
    cost_per_million = cost_per_token * 1e6
    return cost_per_million * quality_factor

# H100 SXM pricing (cloud, approximate):
# On-demand: $3.50/hour
# Reserved: $2.10/hour
# Spot: $1.20/hour

# Llama-2-70B serving cost analysis (on-demand H100):
configs = {
    "FP16, 2x H100, TP=2": {
        "gpus": 2,
        "cost_per_hour": 7.00,
        "throughput": 2150,  # Total tokens/sec
    },
    "INT8, 1x H100": {
        "gpus": 1,
        "cost_per_hour": 3.50,
        "throughput": 3400,
    },
    "INT4, 1x H100": {
        "gpus": 1,
        "cost_per_hour": 3.50,
        "throughput": 5800,
    },
    "INT4, 1x RTX 4090": {
        "gpus": 1,
        "cost_per_hour": 0.40,  # Much cheaper consumer GPU
        "throughput": 1200,
    },
}

for name, cfg in configs.items():
    cpm = cost_per_million_tokens(cfg["cost_per_hour"], cfg["throughput"])
    print(f"{name}: ${cpm:.3f} per million tokens")

# Results:
# FP16, 2x H100, TP=2:  $0.905 per million tokens
# INT8, 1x H100:        $0.286 per million tokens (3.2x cheaper)
# INT4, 1x H100:        $0.168 per million tokens (5.4x cheaper)
# INT4, 1x RTX 4090:    $0.093 per million tokens (9.7x cheaper, but lower QoS)

📊

Cost Efficiency by Quantization Method (Llama-2-70B, H100 On-Demand)

Configuration	GPUs	Throughput	Cost/M Tokens	Quality (MMLU)
FP16, TP=2	2x H100	2,150 tok/s	$0.905	69.8%
INT8 (SmoothQuant)	1x H100	3,400 tok/s	$0.286	69.1%
INT4 (AWQ g128)	1x H100	5,800 tok/s	$0.168	69.3%
INT4 (GPTQ g128)	1x H100	5,500 tok/s	$0.177	69.1%
INT4 (AWQ), RTX 4090	1x 4090	1,200 tok/s	$0.093	69.3%

The Complete Benchmarking Framework

# Full benchmarking pipeline: quality + throughput + cost

class QuantizationBenchmark:
    def __init__(self, model_name, quantization_configs):
        self.model_name = model_name
        self.configs = quantization_configs  # List of quant methods to compare
        self.results = {}

    def run(self):
        for config_name, config in self.configs.items():
            print(f"Benchmarking: {config_name}")

            # 1. Load model
            model, tokenizer = load_quantized_model(self.model_name, config)

            # 2. Quality evaluation
            quality = {}
            quality["perplexity_wikitext2"] = measure_perplexity(
                model, tokenizer, load_dataset("wikitext2_test")
            )
            quality["perplexity_c4"] = measure_perplexity(
                model, tokenizer, load_dataset("c4_validation")
            )
            quality["gsm8k"] = evaluate_gsm8k(model, tokenizer)
            quality["humaneval"] = evaluate_humaneval(model, tokenizer)
            quality["mmlu"] = evaluate_mmlu(model, tokenizer)
            quality["ifeval"] = evaluate_ifeval(model, tokenizer)

            # 3. Throughput evaluation
            throughput = {}
            throughput["decode_bs1"] = measure_decode_throughput(
                model, batch_sizes=[1], context_len=512
            )
            throughput["decode_bs32"] = measure_decode_throughput(
                model, batch_sizes=[32], context_len=512
            )
            throughput["prefill_2048"] = measure_prefill_throughput(
                model, seq_lengths=[2048], batch_size=1
            )
            throughput["e2e_latency"] = measure_end_to_end(
                model, tokenizer,
                prompts=load_benchmark_prompts("sharegpt_sample_500"),
                max_new_tokens=256
            )

            # 4. Memory measurement
            memory = {
                "model_size_gb": get_model_memory_gb(model),
                "peak_memory_gb": torch.cuda.max_memory_allocated() / 1e9,
            }

            # 5. Latency percentiles
            latencies = [r["total_ms"] for r in throughput["e2e_latency"]]
            percentiles = latency_percentile_analysis(latencies)

            self.results[config_name] = {
                "quality": quality,
                "throughput": throughput,
                "memory": memory,
                "latency_percentiles": percentiles,
            }

            # Cleanup
            del model
            torch.cuda.empty_cache()

        # 6. Statistical comparison
        self.compare_results()

    def compare_results(self):
        """Compare all configs against the FP16 baseline."""
        baseline = self.results.get("FP16", None)
        if baseline is None:
            return

        for config_name, result in self.results.items():
            if config_name == "FP16":
                continue

            print(f"\n--- {config_name} vs FP16 ---")
            for task, score in result["quality"].items():
                base_score = baseline["quality"][task]
                delta = score - base_score
                sig = is_significant(score, base_score, n_samples=get_n_samples(task))
                status = "SIGNIFICANT" if sig["significant"] else "not significant"
                print(f"  {task}: {score:.4f} ({delta:+.4f}) [{status}]")

    def generate_report(self):
        """Generate a structured comparison table."""
        # Output format suitable for blog post Benchmark components
        pass

Reporting Template

Required information for a quantization benchmark report:

1. Model: exact model name, parameter count, architecture
2. Quantization: method, bits, group size, calibration dataset,
   calibration samples, any preprocessing (SmoothQuant alpha)
3. Hardware: GPU model, count, memory, driver version, CUDA version
4. Quality:
   - Perplexity on at least 2 datasets (WikiText-2 + C4/PTB)
   - Task accuracy on at least 3 benchmarks covering reasoning + code + knowledge
   - Confidence intervals for all metrics
   - Sample sizes for each benchmark
5. Throughput:
   - Decode tokens/sec at BS=1 and at the serving batch size
   - Prefill tokens/sec for the target sequence length
   - TTFT (time to first token) at representative prompt lengths
   - P50/P95/P99 latency distributions
6. Cost:
   - GPU type and pricing
   - Cost per million tokens at the measured throughput
7. Reproducibility:
   - Exact software versions (PyTorch, transformers, vLLM, etc.)
   - Random seeds used
   - Scripts to reproduce all measurements

💡 Tip

The single most common benchmarking mistake is comparing models using different evaluation harnesses. The lm-eval-harness, vLLM evaluation, and HuggingFace evaluate library can produce different scores for the same model on the same benchmark due to differences in prompt formatting, sampling, and tokenization. Always use the same evaluation code for all configurations in a comparison.

Benchmark Anti-Patterns

# Anti-pattern 1: Cherry-picking the favorable benchmark
# "Our INT4 model achieves 99.5% of FP16 on MMLU!"
# (But they didn't measure GSM8K where it drops 8%)

# Anti-pattern 2: Measuring throughput without realistic serving load
# "Our INT4 model achieves 10,000 tokens/sec!"
# (But that is BS=256 prefill throughput, not decode at BS=1)

# Anti-pattern 3: Comparing different model sizes
# "INT4 70B matches FP16 13B on MMLU at half the memory!"
# (But this is not a fair comparison of quantization quality)

# Anti-pattern 4: Ignoring calibration time in cost analysis
# GPTQ calibration on 70B: 4 GPU-hours
# AQLM calibration on 70B: 128 GPU-hours
# The amortized calibration cost matters for frequent model updates

# Anti-pattern 5: Using temperature=0 for all evaluations
# Greedy decoding masks the quality difference between models
# Use temperature=0.7 or the task-appropriate sampling to reveal
# distributional differences

# Anti-pattern 6: Benchmarking on the calibration dataset
# If you calibrated on WikiText-2, your WikiText-2 perplexity is biased
# Always evaluate on held-out data that was not used for calibration

📊

What a Complete Benchmark Report Looks Like (Llama-2-70B INT4 AWQ)

Metric	FP16 Baseline	INT4 AWQ	Delta	95% CI
WikiText-2 PPL	3.32	3.39	+0.07	N/A (PPL)
C4 PPL	6.47	6.62	+0.15	N/A (PPL)
GSM8K (n=1319)	56.8%	53.4%	-3.4pp	[1.0, 5.8]pp
HumanEval (n=164)	29.9%	27.1%	-2.8pp	[-3.2, 8.8]pp NS
MMLU (n=14042)	69.8%	69.3%	-0.5pp	[-0.3, 1.3]pp NS
Decode tok/s (BS=1)	45	98	+2.18x	N/A
Cost/M tokens	$0.905	$0.168	-81%	N/A

Summary

Quantization benchmarking requires measuring quality on task-specific benchmarks (not just perplexity), throughput under realistic serving conditions (not just peak numbers), and cost-per-token that accounts for GPU pricing and utilization. Report confidence intervals for quality metrics and latency percentiles for performance metrics. The most common failure modes are: evaluating only on perplexity, comparing models with different evaluation harnesses, measuring throughput at unrealistic batch sizes, and not controlling for GPU thermal state and clock speeds. A rigorous benchmark must include at least one reasoning task (GSM8K), one code task (HumanEval/MBPP), and one knowledge task (MMLU), alongside perplexity, decode throughput at production batch size, and tail latency percentiles.