Most quantization benchmarks are misleading. A common pattern: a paper reports WikiText-2 perplexity for a quantized Llama model, shows a 0.1 PPL increase, and declares the quantization “near-lossless.” But WikiText-2 perplexity measures a narrow distribution of Wikipedia text. It does not capture the model’s ability to follow instructions, generate code, perform multi-step reasoning, or handle multilingual inputs. A 0.1 PPL increase on WikiText-2 can correspond to a 5-10 percentage point drop on GSM8K (math reasoning) or HumanEval (code generation).
Quantization benchmarking requires measuring three axes: quality (does the model still produce correct outputs?), throughput (how many tokens per second per dollar?), and the interaction between them (what is the quality-throughput Pareto frontier?). This post provides a rigorous methodology for all three.
Quality Benchmarking
Perplexity: Necessary but Not Sufficient
Perplexity measures how well the model predicts held-out text. It is the exponential of the average cross-entropy loss:
\text{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log p(x_i | x_{<i})\right)
# Correct perplexity measurement for quantized models
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def measure_perplexity(model, tokenizer, dataset, max_length=2048, stride=512):
"""
Sliding-window perplexity with proper stride handling.
Common mistakes to avoid:
1. Not using a sliding window (truncates long documents)
2. Using stride = max_length (misses cross-boundary predictions)
3. Not computing per-token loss (aggregating wrong)
4. Including padding tokens in the loss
"""
model.eval()
nlls = []
for text in dataset:
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids.to(model.device)
seq_len = input_ids.shape[1]
prev_end_loc = 0
for begin_loc in range(0, seq_len, stride):
end_loc = min(begin_loc + max_length, seq_len)
trg_len = end_loc - prev_end_loc # Tokens to compute loss on
input_chunk = input_ids[:, begin_loc:end_loc]
# Create target labels
target_ids = input_chunk.clone()
# Mask out tokens we already computed loss for
target_ids[:, :-trg_len] = -100
with torch.no_grad():
outputs = model(input_chunk, labels=target_ids)
# outputs.loss is averaged over non-masked tokens
neg_log_likelihood = outputs.loss * trg_len
nlls.append(neg_log_likelihood)
prev_end_loc = end_loc
if end_loc == seq_len:
break
ppl = torch.exp(torch.stack(nlls).sum() / sum_tokens(nlls))
return ppl.item()
# Critical: use the SAME evaluation code for both FP16 and quantized models
# Any difference in tokenization, context window, or stride invalidates comparison
Perplexity is sensitive to the evaluation dataset and methodology. Differences in stride length, context window handling, and tokenizer configuration can shift perplexity by 0.1-0.5 points, which is the same magnitude as many quantization effects. Always compare models using identical evaluation code running on the same dataset with the same hyperparameters.
Task-Specific Evaluation
Perplexity misses task-specific degradation. A quantized model may have near-identical perplexity but fail on structured tasks:
# Comprehensive task-specific evaluation suite
evaluation_suite = {
# Reasoning
"GSM8K": {
"metric": "exact_match_accuracy",
"n_shots": 8,
"description": "Grade school math word problems",
"sensitive_to_quantization": True, # Reasoning chains break
},
"ARC-Challenge": {
"metric": "accuracy",
"n_shots": 25,
"description": "Science question answering",
"sensitive_to_quantization": False, # Multiple choice is robust
},
# Code generation
"HumanEval": {
"metric": "pass@1",
"n_shots": 0,
"description": "Python function completion",
"sensitive_to_quantization": True, # Exact syntax matters
},
"MBPP": {
"metric": "pass@1",
"n_shots": 3,
"description": "Mostly Basic Python Problems",
"sensitive_to_quantization": True,
},
# Knowledge
"MMLU": {
"metric": "accuracy",
"n_shots": 5,
"description": "Massive multitask language understanding",
"sensitive_to_quantization": False, # MC robust to small noise
},
"TriviaQA": {
"metric": "exact_match",
"n_shots": 5,
"description": "Open-domain QA",
"sensitive_to_quantization": True, # Exact recall matters
},
# Instruction following
"IFEval": {
"metric": "instruction_following_rate",
"n_shots": 0,
"description": "Instruction following evaluation",
"sensitive_to_quantization": True, # Format compliance breaks
},
"MT-Bench": {
"metric": "gpt4_judge_score",
"n_shots": 0,
"description": "Multi-turn conversation quality",
"sensitive_to_quantization": False, # LLM judge robust to style shifts
},
}
def run_evaluation_suite(model, tokenizer, suite):
"""Run all benchmarks and return structured results."""
results = {}
for task_name, config in suite.items():
dataset = load_benchmark(task_name)
predictions = generate_predictions(model, tokenizer, dataset, config)
score = compute_metric(predictions, dataset, config["metric"])
results[task_name] = {
"score": score,
"n_samples": len(dataset),
"metric": config["metric"],
}
return results
Task-Specific Quality Impact of INT4 Quantization (Llama-2-70B)
| Benchmark | FP16 | GPTQ INT4 | AWQ INT4 | Delta (worst) |
|---|---|---|---|---|
| WikiText-2 PPL | 3.32 | 3.41 | 3.39 | +0.09 |
| GSM8K (accuracy) | 56.8% | 52.1% | 53.4% | -4.7pp |
| HumanEval (pass@1) | 29.9% | 26.2% | 27.1% | -3.7pp |
| MMLU (accuracy) | 69.8% | 69.1% | 69.3% | -0.7pp |
| ARC-Challenge | 64.4% | 63.8% | 64.0% | -0.6pp |
| IFEval (strict) | 52.1% | 48.3% | 49.5% | -3.8pp |
| MT-Bench (avg) | 7.42 | 7.31 | 7.35 | -0.11 |
The pattern is consistent across models: tasks requiring exact reasoning (GSM8K), precise code generation (HumanEval), or strict format compliance (IFEval) are 3-5x more sensitive to quantization than multiple-choice knowledge tasks (MMLU, ARC) or open-ended generation (MT-Bench). If your application involves structured output, code, or math, perplexity alone is an unreliable quality indicator.
Statistical Significance
Small evaluation sets produce noisy estimates. A 2 percentage point difference on HumanEval (164 problems) is not statistically significant:
import numpy as np
from scipy import stats
def is_significant(score_a, score_b, n_samples, confidence=0.95):
"""
Test whether the difference between two accuracy scores
is statistically significant using a two-proportion z-test.
"""
p1 = score_a
p2 = score_b
p_pool = (p1 * n_samples + p2 * n_samples) / (2 * n_samples)
se = np.sqrt(2 * p_pool * (1 - p_pool) / n_samples)
z = abs(p1 - p2) / se
p_value = 2 * (1 - stats.norm.cdf(z))
z_critical = stats.norm.ppf(1 - (1 - confidence) / 2)
significant = z > z_critical
return {
"z_statistic": z,
"p_value": p_value,
"significant": significant,
"min_detectable_difference": z_critical * se,
}
# HumanEval: 164 problems
result = is_significant(0.299, 0.262, 164)
# z=0.82, p=0.41 -> NOT significant
# Min detectable difference: ±7.0 percentage points at 95% confidence
# HumanEval is too small to detect typical quantization effects!
# GSM8K: 1319 problems
result = is_significant(0.568, 0.521, 1319)
# z=2.54, p=0.011 -> Significant
# Min detectable difference: ±2.7 percentage points at 95% confidence
# MMLU: ~14000 questions across all subjects
result = is_significant(0.698, 0.691, 14000)
# z=0.78, p=0.44 -> NOT significant (0.7pp is within noise)
# Min detectable difference: ±0.8 percentage points at 95% confidence
A benchmark must have enough samples to detect the expected effect size. HumanEval (164 problems) cannot reliably detect less than a 7 percentage point difference. MMLU (14K questions) can detect 0.8pp differences. Always report confidence intervals, not just point estimates. When the difference is within the confidence interval, the result is inconclusive, not “near-lossless.”
Throughput Benchmarking
What to Measure
Throughput depends on the workload. The same quantized model shows different speedups for different request patterns:
# Throughput measurement framework
class ThroughputBenchmark:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def measure_decode_throughput(self, batch_sizes, n_warmup=10, n_measure=100):
"""
Measure pure decode throughput (single token generation).
This is the memory-bandwidth-bound regime.
"""
results = {}
for bs in batch_sizes:
# Warmup
for _ in range(n_warmup):
dummy_input = torch.randint(0, 32000, (bs, 128)).cuda()
with torch.no_grad():
self.model(dummy_input)
torch.cuda.synchronize()
# Measure
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(n_measure):
with torch.no_grad():
# Simulate decode: context of 512 tokens, generate 1
self.model.generate(
dummy_input[:, :512],
max_new_tokens=1,
do_sample=False,
)
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end)
tokens_per_sec = (bs * n_measure) / (elapsed_ms / 1000)
results[bs] = tokens_per_sec
return results
def measure_prefill_throughput(self, seq_lengths, batch_size=1):
"""
Measure prefill throughput (processing input prompt).
This is the compute-bound regime.
"""
results = {}
for seq_len in seq_lengths:
input_ids = torch.randint(0, 32000, (batch_size, seq_len)).cuda()
# Warmup
for _ in range(5):
with torch.no_grad():
self.model(input_ids)
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
n_iter = 50
start.record()
for _ in range(n_iter):
with torch.no_grad():
self.model(input_ids)
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end)
tokens_per_sec = (batch_size * seq_len * n_iter) / (elapsed_ms / 1000)
results[seq_len] = tokens_per_sec
return results
def measure_end_to_end(self, prompts, max_new_tokens=256):
"""
End-to-end benchmark: prefill + decode with real prompts.
Most representative of production performance.
"""
latencies = []
for prompt in prompts:
input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.cuda()
torch.cuda.synchronize()
start_time = torch.cuda.Event(enable_timing=True)
end_time = torch.cuda.Event(enable_timing=True)
start_time.record()
output = self.model.generate(
input_ids,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
)
end_time.record()
torch.cuda.synchronize()
n_generated = output.shape[1] - input_ids.shape[1]
elapsed_ms = start_time.elapsed_time(end_time)
latencies.append({
"prompt_tokens": input_ids.shape[1],
"generated_tokens": n_generated,
"total_ms": elapsed_ms,
"ttft_ms": elapsed_ms * (input_ids.shape[1] / output.shape[1]),
"tps": n_generated / (elapsed_ms / 1000),
})
return latencies
Common Throughput Measurement Mistakes
# MISTAKE 1: Measuring throughput without warmup
# The first few iterations include JIT compilation, memory allocation,
# and CUDA context setup. They can be 2-10x slower.
# MISTAKE 2: Using torch.cuda.synchronize() incorrectly
# BAD: timing includes CPU overhead
import time
start = time.time()
model(input_ids) # This returns immediately (async!)
torch.cuda.synchronize() # This waits for GPU
end = time.time()
# The measured time includes CPU-side Python overhead
# GOOD: use CUDA events for GPU-only timing
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
model(input_ids)
end_event.record()
torch.cuda.synchronize()
gpu_time_ms = start_event.elapsed_time(end_event)
# MISTAKE 3: Not controlling for KV cache size
# Decode throughput at position 100 is different from position 4000
# because the KV cache read dominates memory bandwidth at long contexts
# MISTAKE 4: Reporting peak throughput instead of sustained throughput
# Peak: BS=1, short context, ideal conditions
# Sustained: continuous serving with mixed request lengths and batch sizes
# MISTAKE 5: Not controlling GPU clock speed
# GPU may be in a low-power state at the start, then boost
# Use nvidia-smi to lock clocks:
# nvidia-smi -lgc 1980,1980 # Lock graphics clock to max
# nvidia-smi -lmc 1593,1593 # Lock memory clock to max
Latency Percentile Analysis
Mean latency hides the tail. For production serving, P50, P95, and P99 latencies matter:
import numpy as np
def latency_percentile_analysis(latencies_ms):
"""Compute latency percentiles from a list of measurements."""
arr = np.array(latencies_ms)
return {
"P50": np.percentile(arr, 50),
"P90": np.percentile(arr, 90),
"P95": np.percentile(arr, 95),
"P99": np.percentile(arr, 99),
"P99.9": np.percentile(arr, 99.9),
"mean": np.mean(arr),
"std": np.std(arr),
"min": np.min(arr),
"max": np.max(arr),
"n_samples": len(arr),
}
# Example results for quantized model serving:
# FP16 (Llama-2-7B, BS=1, decode):
# P50=8.2ms, P95=9.1ms, P99=12.8ms, P99.9=45.2ms
#
# INT4 (same model, same conditions):
# P50=4.1ms, P95=4.5ms, P99=5.8ms, P99.9=18.3ms
#
# Key observations:
# Median speedup: 2.0x (expected from 4x weight compression)
# P99 speedup: 2.2x (quantized model has tighter distribution)
# P99.9 speedup: 2.5x (fewer outliers due to less memory pressure)
# The tail latency improvement is larger than the median improvement
# because the smaller model causes less memory contention
Latency Distribution: FP16 vs INT4 (Llama-2-7B, H100, BS=1 Decode)
(ms)Cost Analysis
Cost-Per-Token Calculation
The true cost of quantization includes GPU cost, throughput, and quality:
def cost_per_million_tokens(
gpu_hourly_cost, # $/hour for the GPU instance
throughput_tps, # Tokens per second (sustained)
quality_factor=1.0, # Multiplier for quality-adjusted cost
):
"""Calculate the cost per million output tokens."""
tokens_per_hour = throughput_tps * 3600
cost_per_token = gpu_hourly_cost / tokens_per_hour
cost_per_million = cost_per_token * 1e6
return cost_per_million * quality_factor
# H100 SXM pricing (cloud, approximate):
# On-demand: $3.50/hour
# Reserved: $2.10/hour
# Spot: $1.20/hour
# Llama-2-70B serving cost analysis (on-demand H100):
configs = {
"FP16, 2x H100, TP=2": {
"gpus": 2,
"cost_per_hour": 7.00,
"throughput": 2150, # Total tokens/sec
},
"INT8, 1x H100": {
"gpus": 1,
"cost_per_hour": 3.50,
"throughput": 3400,
},
"INT4, 1x H100": {
"gpus": 1,
"cost_per_hour": 3.50,
"throughput": 5800,
},
"INT4, 1x RTX 4090": {
"gpus": 1,
"cost_per_hour": 0.40, # Much cheaper consumer GPU
"throughput": 1200,
},
}
for name, cfg in configs.items():
cpm = cost_per_million_tokens(cfg["cost_per_hour"], cfg["throughput"])
print(f"{name}: ${cpm:.3f} per million tokens")
# Results:
# FP16, 2x H100, TP=2: $0.905 per million tokens
# INT8, 1x H100: $0.286 per million tokens (3.2x cheaper)
# INT4, 1x H100: $0.168 per million tokens (5.4x cheaper)
# INT4, 1x RTX 4090: $0.093 per million tokens (9.7x cheaper, but lower QoS)
Cost Efficiency by Quantization Method (Llama-2-70B, H100 On-Demand)
| Configuration | GPUs | Throughput | Cost/M Tokens | Quality (MMLU) |
|---|---|---|---|---|
| FP16, TP=2 | 2x H100 | 2,150 tok/s | $0.905 | 69.8% |
| INT8 (SmoothQuant) | 1x H100 | 3,400 tok/s | $0.286 | 69.1% |
| INT4 (AWQ g128) | 1x H100 | 5,800 tok/s | $0.168 | 69.3% |
| INT4 (GPTQ g128) | 1x H100 | 5,500 tok/s | $0.177 | 69.1% |
| INT4 (AWQ), RTX 4090 | 1x 4090 | 1,200 tok/s | $0.093 | 69.3% |
The Complete Benchmarking Framework
# Full benchmarking pipeline: quality + throughput + cost
class QuantizationBenchmark:
def __init__(self, model_name, quantization_configs):
self.model_name = model_name
self.configs = quantization_configs # List of quant methods to compare
self.results = {}
def run(self):
for config_name, config in self.configs.items():
print(f"Benchmarking: {config_name}")
# 1. Load model
model, tokenizer = load_quantized_model(self.model_name, config)
# 2. Quality evaluation
quality = {}
quality["perplexity_wikitext2"] = measure_perplexity(
model, tokenizer, load_dataset("wikitext2_test")
)
quality["perplexity_c4"] = measure_perplexity(
model, tokenizer, load_dataset("c4_validation")
)
quality["gsm8k"] = evaluate_gsm8k(model, tokenizer)
quality["humaneval"] = evaluate_humaneval(model, tokenizer)
quality["mmlu"] = evaluate_mmlu(model, tokenizer)
quality["ifeval"] = evaluate_ifeval(model, tokenizer)
# 3. Throughput evaluation
throughput = {}
throughput["decode_bs1"] = measure_decode_throughput(
model, batch_sizes=[1], context_len=512
)
throughput["decode_bs32"] = measure_decode_throughput(
model, batch_sizes=[32], context_len=512
)
throughput["prefill_2048"] = measure_prefill_throughput(
model, seq_lengths=[2048], batch_size=1
)
throughput["e2e_latency"] = measure_end_to_end(
model, tokenizer,
prompts=load_benchmark_prompts("sharegpt_sample_500"),
max_new_tokens=256
)
# 4. Memory measurement
memory = {
"model_size_gb": get_model_memory_gb(model),
"peak_memory_gb": torch.cuda.max_memory_allocated() / 1e9,
}
# 5. Latency percentiles
latencies = [r["total_ms"] for r in throughput["e2e_latency"]]
percentiles = latency_percentile_analysis(latencies)
self.results[config_name] = {
"quality": quality,
"throughput": throughput,
"memory": memory,
"latency_percentiles": percentiles,
}
# Cleanup
del model
torch.cuda.empty_cache()
# 6. Statistical comparison
self.compare_results()
def compare_results(self):
"""Compare all configs against the FP16 baseline."""
baseline = self.results.get("FP16", None)
if baseline is None:
return
for config_name, result in self.results.items():
if config_name == "FP16":
continue
print(f"\n--- {config_name} vs FP16 ---")
for task, score in result["quality"].items():
base_score = baseline["quality"][task]
delta = score - base_score
sig = is_significant(score, base_score, n_samples=get_n_samples(task))
status = "SIGNIFICANT" if sig["significant"] else "not significant"
print(f" {task}: {score:.4f} ({delta:+.4f}) [{status}]")
def generate_report(self):
"""Generate a structured comparison table."""
# Output format suitable for blog post Benchmark components
pass
Reporting Template
Required information for a quantization benchmark report:
1. Model: exact model name, parameter count, architecture
2. Quantization: method, bits, group size, calibration dataset,
calibration samples, any preprocessing (SmoothQuant alpha)
3. Hardware: GPU model, count, memory, driver version, CUDA version
4. Quality:
- Perplexity on at least 2 datasets (WikiText-2 + C4/PTB)
- Task accuracy on at least 3 benchmarks covering reasoning + code + knowledge
- Confidence intervals for all metrics
- Sample sizes for each benchmark
5. Throughput:
- Decode tokens/sec at BS=1 and at the serving batch size
- Prefill tokens/sec for the target sequence length
- TTFT (time to first token) at representative prompt lengths
- P50/P95/P99 latency distributions
6. Cost:
- GPU type and pricing
- Cost per million tokens at the measured throughput
7. Reproducibility:
- Exact software versions (PyTorch, transformers, vLLM, etc.)
- Random seeds used
- Scripts to reproduce all measurements
The single most common benchmarking mistake is comparing models using different evaluation harnesses. The lm-eval-harness, vLLM evaluation, and HuggingFace evaluate library can produce different scores for the same model on the same benchmark due to differences in prompt formatting, sampling, and tokenization. Always use the same evaluation code for all configurations in a comparison.
Benchmark Anti-Patterns
# Anti-pattern 1: Cherry-picking the favorable benchmark
# "Our INT4 model achieves 99.5% of FP16 on MMLU!"
# (But they didn't measure GSM8K where it drops 8%)
# Anti-pattern 2: Measuring throughput without realistic serving load
# "Our INT4 model achieves 10,000 tokens/sec!"
# (But that is BS=256 prefill throughput, not decode at BS=1)
# Anti-pattern 3: Comparing different model sizes
# "INT4 70B matches FP16 13B on MMLU at half the memory!"
# (But this is not a fair comparison of quantization quality)
# Anti-pattern 4: Ignoring calibration time in cost analysis
# GPTQ calibration on 70B: 4 GPU-hours
# AQLM calibration on 70B: 128 GPU-hours
# The amortized calibration cost matters for frequent model updates
# Anti-pattern 5: Using temperature=0 for all evaluations
# Greedy decoding masks the quality difference between models
# Use temperature=0.7 or the task-appropriate sampling to reveal
# distributional differences
# Anti-pattern 6: Benchmarking on the calibration dataset
# If you calibrated on WikiText-2, your WikiText-2 perplexity is biased
# Always evaluate on held-out data that was not used for calibration
What a Complete Benchmark Report Looks Like (Llama-2-70B INT4 AWQ)
| Metric | FP16 Baseline | INT4 AWQ | Delta | 95% CI |
|---|---|---|---|---|
| WikiText-2 PPL | 3.32 | 3.39 | +0.07 | N/A (PPL) |
| C4 PPL | 6.47 | 6.62 | +0.15 | N/A (PPL) |
| GSM8K (n=1319) | 56.8% | 53.4% | -3.4pp | [1.0, 5.8]pp |
| HumanEval (n=164) | 29.9% | 27.1% | -2.8pp | [-3.2, 8.8]pp NS |
| MMLU (n=14042) | 69.8% | 69.3% | -0.5pp | [-0.3, 1.3]pp NS |
| Decode tok/s (BS=1) | 45 | 98 | +2.18x | N/A |
| Cost/M tokens | $0.905 | $0.168 | -81% | N/A |
Summary
Quantization benchmarking requires measuring quality on task-specific benchmarks (not just perplexity), throughput under realistic serving conditions (not just peak numbers), and cost-per-token that accounts for GPU pricing and utilization. Report confidence intervals for quality metrics and latency percentiles for performance metrics. The most common failure modes are: evaluating only on perplexity, comparing models with different evaluation harnesses, measuring throughput at unrealistic batch sizes, and not controlling for GPU thermal state and clock speeds. A rigorous benchmark must include at least one reasoning task (GSM8K), one code task (HumanEval/MBPP), and one knowledge task (MMLU), alongside perplexity, decode throughput at production batch size, and tail latency percentiles.