FP8 on H100 delivers 2x throughput over FP16—but only if your model is over 13B parameters and your batch size exceeds 16. Below that threshold, the FP8 tensor core dispatch overhead exceeds the compute savings and you get slower inference. INT4 via Marlin delivers 3.8x throughput on batch size 1, but requires AWQ or GPTQ quantization which adds 0.5-1.0 perplexity points—acceptable for chatbots, catastrophic for code generation. llama.cpp’s GGUF Q4_K_M format achieves 80% of Marlin’s throughput on CPU-only inference, making it the only viable option when you do not have GPUs. There is no universal best quantization method—the optimal choice depends on seven variables: model size, hardware, batch size distribution, latency SLO, quality floor, multi-GPU topology, and cost constraints.
This post is the decision framework. It maps those constraints to the optimal quantization method with worked examples for the most common deployment scenarios: single-GPU inference, multi-GPU tensor parallelism, CPU-only serving, and hybrid CPU+GPU offloading.
The Decision Variables
Seven variables determine the optimal quantization method:
class DeploymentConstraints:
"""The variables that determine quantization method selection."""
def __init__(
self,
model_params_B, # Model size in billions of parameters
available_gpu_memory_GB, # Per-GPU VRAM (0 for CPU-only)
num_gpus, # Number of GPUs available
gpu_type, # 'H100', 'A100', 'L40S', 'RTX4090', 'CPU', etc
target_latency_ms, # First-token latency SLO (TTFT)
target_throughput_tps, # Minimum decode tokens/sec per request
max_batch_size, # Maximum concurrent requests
quality_floor_ppl, # Maximum acceptable perplexity degradation
cost_priority, # 'minimize_cost', 'minimize_latency', 'balanced'
):
self.model_params_B = model_params_B
self.available_gpu_memory_GB = available_gpu_memory_GB
self.num_gpus = num_gpus
self.gpu_type = gpu_type
self.target_latency_ms = target_latency_ms
self.target_throughput_tps = target_throughput_tps
self.max_batch_size = max_batch_size
self.quality_floor_ppl = quality_floor_ppl
self.cost_priority = cost_priority
Step 1: Does the Model Fit?
The first constraint is memory. The model must fit in available GPU memory (or system RAM for CPU inference), leaving room for KV cache, activations, and framework overhead.
def estimate_model_memory(
params_B,
bits_per_weight,
kv_cache_tokens=2048,
num_layers=32,
num_kv_heads=8,
head_dim=128,
kv_cache_bits=16,
batch_size=1,
):
"""Estimate total memory for model + KV cache.
Returns memory in GB.
"""
# Model weights
weight_bytes = params_B * 1e9 * bits_per_weight / 8
weight_GB = weight_bytes / 1e9
# KV cache per token: 2 * num_layers * num_kv_heads * head_dim * bytes
kv_bytes_per_token = (
2 * num_layers * num_kv_heads * head_dim * kv_cache_bits / 8
)
kv_GB = (
kv_bytes_per_token * kv_cache_tokens * batch_size / 1e9
)
# Framework overhead (buffers, CUDA context, etc): ~1-2 GB
overhead_GB = 1.5
total_GB = weight_GB + kv_GB + overhead_GB
return {
'weights_GB': weight_GB,
'kv_cache_GB': kv_GB,
'overhead_GB': overhead_GB,
'total_GB': total_GB,
}
# Memory estimation for common configurations
configs = [
('Llama-2 7B', 6.7, 32, 32, 128),
('Llama-2 13B', 13.0, 40, 40, 128),
('Llama-2 70B', 65.0, 80, 8, 128),
('Llama-3.1 8B', 8.0, 32, 8, 128),
('Llama-3.1 70B', 70.0, 80, 8, 128),
('Llama-3.1 405B', 405.0, 126, 8, 128),
]
for name, params, layers, kv_heads, hdim in configs:
for bits in [16, 8, 4]:
mem = estimate_model_memory(
params, bits, kv_cache_tokens=4096,
num_layers=layers, num_kv_heads=kv_heads,
head_dim=hdim, batch_size=8
)
fits_80gb = "fits" if mem['total_GB'] <= 80 else "NO"
fits_24gb = "fits" if mem['total_GB'] <= 24 else "NO"
print(f" {name:>20s} @ {bits:2d}b: {mem['total_GB']:6.1f} GB "
f"[A100-80: {fits_80gb:>4s}, RTX4090-24: {fits_24gb:>4s}]")
print()
Model Memory at Different Precisions (batch=8, 4096 context)
| Model | FP16 (GB) | INT8 (GB) | INT4 (GB) | Fits A100-80? | Fits RTX4090? |
|---|---|---|---|---|---|
| Llama-2 7B | 15.2 | 9.1 | 6.1 | Yes (all) | INT4 only |
| Llama-2 13B | 27.8 | 15.4 | 9.2 | Yes (all) | No |
| Llama-2 70B | 132.5 | 67.8 | 35.4 | INT4 only | No |
| Llama-3.1 8B | 17.8 | 10.4 | 6.7 | Yes (all) | INT4 only |
| Llama-3.1 70B | 142.5 | 72.8 | 38.0 | INT4 only | No |
| Llama-3.1 405B | 812.5 | 407.8 | 205.4 | Multi-GPU INT4 | No |
Step 2: Batch Size Determines Compute vs Bandwidth Regime
The critical bifurcation: is your workload bandwidth-bound (small batches, decode-heavy) or compute-bound (large batches, prefill-heavy)?
def classify_workload(
batch_size,
hidden_dim,
gpu_type,
):
"""Classify workload as bandwidth-bound or compute-bound.
Returns the crossover batch size and classification.
"""
# GPU specs
gpu_specs = {
'H100': {'bw_TB_s': 3.35, 'fp16_TFLOPS': 990, 'int8_TOPS': 1979},
'A100': {'bw_TB_s': 2.0, 'fp16_TFLOPS': 312, 'int8_TOPS': 624},
'L40S': {'bw_TB_s': 0.864, 'fp16_TFLOPS': 366, 'int8_TOPS': 733},
'RTX4090': {'bw_TB_s': 1.008, 'fp16_TFLOPS': 330, 'int8_TOPS': 660},
}
spec = gpu_specs.get(gpu_type, gpu_specs['A100'])
# Crossover: batch_size where compute_time = bandwidth_time
# For FP16: crossover = BW * 2 / (FP16_FLOPS / hidden_dim)
# Simplified: crossover ~ BW_bytes/s / (FLOPS/s / (2 * hidden_dim))
fp16_crossover = int(
spec['bw_TB_s'] * 1e12 * 2 /
(spec['fp16_TFLOPS'] * 1e12 / hidden_dim)
)
# For INT8: higher compute = higher crossover
int8_crossover = int(
spec['bw_TB_s'] * 1e12 * 1 /
(spec['int8_TOPS'] * 1e12 / hidden_dim)
)
if batch_size < fp16_crossover:
regime = 'bandwidth_bound'
recommendation = 'W4A16 (minimize bytes loaded)'
elif batch_size < int8_crossover * 2:
regime = 'transitional'
recommendation = 'W4A16 for decode, FP8/INT8 for prefill'
else:
regime = 'compute_bound'
recommendation = 'W8A8 (utilize INT8/FP8 tensor cores)'
return {
'regime': regime,
'fp16_crossover': fp16_crossover,
'int8_crossover': int8_crossover,
'recommendation': recommendation,
}
for gpu in ['H100', 'A100', 'RTX4090']:
result = classify_workload(batch_size=1, hidden_dim=4096, gpu_type=gpu)
print(f"\n{gpu}:")
print(f" FP16 crossover: batch={result['fp16_crossover']}")
print(f" INT8 crossover: batch={result['int8_crossover']}")
for bs in [1, 8, 32, 128, 512]:
r = classify_workload(bs, 4096, gpu)
print(f" batch={bs:>3d}: {r['regime']:>20s} -> {r['recommendation']}")
On H100, the crossover from bandwidth-bound to compute-bound occurs at batch size ~128 for FP16 GEMMs. Below this, W4A16 gives the best throughput (4x bandwidth reduction). Above this, W8A8 with INT8 or FP8 tensor cores gives the best throughput (2x compute increase). Most chat/interactive workloads are below the crossover; batch API workloads are above it.
Step 3: The Decision Tree
def recommend_quantization(constraints):
"""Full decision tree for quantization method selection.
Returns primary recommendation and alternatives.
"""
c = constraints
recommendations = []
# CPU-only path
if c.gpu_type == 'CPU' or c.available_gpu_memory_GB == 0:
if c.quality_floor_ppl <= 0.1:
recommendations.append(('GGUF Q5_K_M', 'Best quality for CPU'))
elif c.quality_floor_ppl <= 0.3:
recommendations.append(('GGUF Q4_K_M', 'Standard CPU choice'))
else:
recommendations.append(('GGUF Q3_K_M', 'Minimum viable'))
return recommendations
# Check if model fits at different precisions
fits_fp16 = estimate_model_memory(
c.model_params_B, 16,
batch_size=c.max_batch_size
)['total_GB'] <= c.available_gpu_memory_GB * c.num_gpus
fits_int8 = estimate_model_memory(
c.model_params_B, 8,
batch_size=c.max_batch_size
)['total_GB'] <= c.available_gpu_memory_GB * c.num_gpus
fits_int4 = estimate_model_memory(
c.model_params_B, 4,
batch_size=c.max_batch_size
)['total_GB'] <= c.available_gpu_memory_GB * c.num_gpus
if not fits_int4:
recommendations.append((
'Need more GPUs or smaller model',
f'Model requires {estimate_model_memory(c.model_params_B, 4)["total_GB"]:.0f} GB at INT4'
))
return recommendations
# Determine regime
regime = classify_workload(
c.max_batch_size,
int(c.model_params_B * 1e9 / (32 * 4096)), # Rough hidden_dim
c.gpu_type
)['regime']
# GPU-specific paths
if c.gpu_type in ['H100', 'H200']:
if regime == 'bandwidth_bound':
if fits_fp16 and c.quality_floor_ppl <= 0.01:
recommendations.append(('FP16', 'No quantization needed'))
recommendations.append(('AWQ INT4 g128 + Marlin', 'Best decode throughput'))
if c.quality_floor_ppl <= 0.05:
recommendations.append(('GPTQ INT4 g128 + Marlin', 'Alternative to AWQ'))
elif regime == 'compute_bound':
recommendations.append(('FP8 W8A8 (Transformer Engine)', 'Best throughput for large batches'))
recommendations.append(('SmoothQuant W8A8 INT8', 'Alternative without FP8'))
else:
recommendations.append(('AWQ INT4 + Marlin (decode)', 'Hybrid strategy'))
recommendations.append(('+ FP8 W8A8 (prefill)', 'Different format per phase'))
elif c.gpu_type == 'A100':
if regime == 'bandwidth_bound':
recommendations.append(('AWQ INT4 g128 + Marlin', 'Best decode throughput'))
elif regime == 'compute_bound':
recommendations.append(('SmoothQuant W8A8 INT8', 'Best for large batches on A100'))
else:
recommendations.append(('AWQ INT4 g128 + Marlin', 'Decode-optimized'))
elif c.gpu_type in ['L40S', 'L4', 'RTX4090']:
if not fits_int8:
recommendations.append(('AWQ INT4 g128', 'Must use INT4 for memory'))
else:
recommendations.append(('AWQ INT4 g128 + ExLlamaV2', 'Best for consumer GPUs'))
# Add quality-sensitive alternatives
if c.quality_floor_ppl <= 0.02:
recommendations.append(('Consider FP8 instead of INT4', 'Tighter quality requirement'))
return recommendations
# Example usage
constraints = DeploymentConstraints(
model_params_B=70,
available_gpu_memory_GB=80,
num_gpus=1,
gpu_type='H100',
target_latency_ms=100,
target_throughput_tps=30,
max_batch_size=8,
quality_floor_ppl=0.3,
cost_priority='minimize_cost',
)
recs = recommend_quantization(constraints)
for method, reason in recs:
print(f" {method}: {reason}")
Step 4: Kernel and Framework Selection
Quantization Method to Kernel/Framework Mapping
| Method | Primary Kernel | Framework | Compatibility Notes |
|---|---|---|---|
| AWQ INT4 g128 | Marlin | vLLM | No act_order, symmetric only |
| GPTQ INT4 g128 | Marlin | vLLM | No act_order for Marlin |
| GPTQ INT4 act_order | ExLlamaV2 | vLLM | Slower than Marlin, better quality |
| FP8 W8A8 | cuBLAS FP8 | TensorRT-LLM, vLLM | H100+ only |
| SmoothQuant INT8 | CUTLASS INT8 | vLLM | A100+, per-token scaling |
| GGUF Q4_K_M | llama.cpp | llama.cpp / ollama | CPU: AVX2/NEON, GPU: CUDA |
| GGUF Q5_K_M | llama.cpp | llama.cpp / ollama | Higher quality, slightly slower |
def select_serving_framework(method, gpu_type, deployment_type):
"""Select the optimal serving framework for the chosen quantization."""
framework_map = {
('AWQ INT4', 'H100', 'cloud'): 'vLLM + Marlin',
('AWQ INT4', 'A100', 'cloud'): 'vLLM + Marlin',
('GPTQ INT4', 'H100', 'cloud'): 'vLLM + Marlin (if no act_order)',
('FP8 W8A8', 'H100', 'cloud'): 'TensorRT-LLM (fastest FP8)',
('INT8 W8A8', 'A100', 'cloud'): 'vLLM + SmoothQuant',
('GGUF', 'CPU', 'edge'): 'llama.cpp or ollama',
('GGUF', 'RTX4090', 'local'): 'llama.cpp (metal/cuda)',
('AWQ INT4', 'RTX4090', 'local'): 'vLLM or ExLlamaV2',
}
key = (method.split(' + ')[0] if ' + ' in method else method,
gpu_type, deployment_type)
return framework_map.get(key, 'vLLM (general purpose)')
Step 5: Quality Validation
Never deploy a quantized model without measuring quality on your target distribution:
def validate_quantized_model(
original_model,
quantized_model,
eval_dataset,
metrics,
):
"""Validate quantized model quality.
Minimum validation:
1. Perplexity on domain-specific holdout set
2. Task-specific accuracy (if applicable)
3. Output distribution comparison (KL divergence)
4. Edge case testing (long context, rare tokens, etc)
"""
results = {}
# Perplexity comparison
ppl_original = compute_perplexity(original_model, eval_dataset)
ppl_quantized = compute_perplexity(quantized_model, eval_dataset)
results['perplexity_original'] = ppl_original
results['perplexity_quantized'] = ppl_quantized
results['perplexity_degradation'] = ppl_quantized - ppl_original
# KL divergence of output distributions
kl_divs = []
for batch in eval_dataset:
with torch.no_grad():
logits_orig = original_model(batch).logits
logits_quant = quantized_model(batch).logits
# KL(orig || quant) per token
p = torch.softmax(logits_orig, dim=-1)
q = torch.softmax(logits_quant, dim=-1)
kl = (p * (p.log() - q.log())).sum(dim=-1).mean()
kl_divs.append(kl.item())
results['mean_kl_divergence'] = sum(kl_divs) / len(kl_divs)
# Pass/fail criteria
results['ppl_pass'] = results['perplexity_degradation'] < metrics['max_ppl_degradation']
results['kl_pass'] = results['mean_kl_divergence'] < metrics['max_kl_divergence']
return results
# Validation thresholds by use case
VALIDATION_THRESHOLDS = {
'chat_general': {'max_ppl_degradation': 0.3, 'max_kl_divergence': 0.1},
'code_generation': {'max_ppl_degradation': 0.2, 'max_kl_divergence': 0.05},
'medical_legal': {'max_ppl_degradation': 0.05, 'max_kl_divergence': 0.01},
'summarization': {'max_ppl_degradation': 0.5, 'max_kl_divergence': 0.2},
}
Worked Examples
Example 1: 7B Chat Model, Single A100-80GB, Interactive Use
# Deployment: Single A100-80GB, Llama-2 7B chat, 1-8 concurrent users
scenario_1 = DeploymentConstraints(
model_params_B=7,
available_gpu_memory_GB=80,
num_gpus=1,
gpu_type='A100',
target_latency_ms=50, # TTFT < 50ms
target_throughput_tps=40, # 40 tok/s per user
max_batch_size=8,
quality_floor_ppl=0.3,
cost_priority='minimize_cost',
)
# Analysis:
# - Model fits at FP16 (15.2 GB) with plenty of room for KV cache
# - Batch size 8 -> bandwidth-bound regime on A100
# - Quality floor 0.3 ppl -> INT4 is acceptable
# Recommendation: AWQ INT4 g128 + Marlin in vLLM
# - Model size: ~4 GB -> leaves 76 GB for KV cache
# - Decode throughput: ~850 tok/s total across batch
# - Per-user throughput: ~106 tok/s (well above 40 tok/s target)
# - Quality: +0.04 ppl degradation (well within 0.3 floor)
# Alternative: FP16 (no quantization)
# - Model size: ~15 GB -> leaves 65 GB for KV cache
# - Decode throughput: ~258 tok/s total
# - Per-user: ~32 tok/s (BELOW 40 tok/s target!)
# -> FP16 FAILS the throughput SLO. Quantization is required.
Example 2: 70B Model, 2x H100, Batch API
# Deployment: 2x H100-80GB, Llama-2 70B, batch processing
scenario_2 = DeploymentConstraints(
model_params_B=70,
available_gpu_memory_GB=80,
num_gpus=2,
gpu_type='H100',
target_latency_ms=5000, # 5s TTFT acceptable for batch
target_throughput_tps=10, # 10 tok/s per request minimum
max_batch_size=128,
quality_floor_ppl=0.1,
cost_priority='minimize_cost',
)
# Analysis:
# - FP16: 132.5 GB -> needs 2x H100 (fits, but no room for large batch KV)
# - INT4: 35.4 GB -> fits on 1x H100 with room for KV cache!
# - Batch 128 -> compute-bound on H100
# - Quality floor 0.1 ppl -> need high-quality quantization
# Recommendation: FP8 W8A8 + TensorRT-LLM on 2x H100
# - Model size (FP8): ~70 GB across 2 GPUs
# - Throughput: ~1.5x FP16 from FP8 tensor cores
# - Quality: +0.02 ppl (within 0.1 floor)
# Alternative: AWQ INT4 on 1x H100
# - Saves one GPU (cost reduction!)
# - But: batch 128 is compute-bound, INT4 does not help compute
# - Throughput would be similar to FP16 (no compute benefit)
# -> FP8 on 2 GPUs is better for high-throughput batch processing
Example 3: 7B Model on MacBook Pro (CPU/Metal)
# Deployment: MacBook Pro M2 Max, 32GB RAM, personal use
scenario_3 = DeploymentConstraints(
model_params_B=7,
available_gpu_memory_GB=0, # CPU inference
num_gpus=0,
gpu_type='CPU',
target_latency_ms=500,
target_throughput_tps=15, # 15 tok/s for interactive use
max_batch_size=1,
quality_floor_ppl=0.2,
cost_priority='minimize_latency',
)
# Analysis:
# - CPU/Metal inference only -> llama.cpp with GGUF
# - 32 GB RAM -> model + KV cache must fit
# - Quality floor 0.2 ppl -> Q4_K_M or Q5_K_M
# Recommendation: GGUF Q5_K_M via llama.cpp
# - Model size: 4.5 GB (fits easily in 32 GB)
# - Throughput on M2 Max: ~24 tok/s (above 15 target)
# - Quality: +0.05 ppl (within 0.2 floor)
# Alternative: GGUF Q4_K_M
# - Model size: 3.8 GB
# - Throughput: ~28 tok/s (faster)
# - Quality: +0.16 ppl (still within floor)
# -> Q5_K_M is preferred for quality, Q4_K_M if speed matters more
Cost Optimization
The cost of serving a quantized model depends on GPU-hours. Lower precision = fewer GPUs = lower cost:
def serving_cost_estimate(
model_params_B,
bits_per_weight,
gpu_type,
gpu_hourly_cost,
target_total_throughput_tps,
avg_batch_size,
):
"""Estimate serving cost per 1M tokens."""
# Estimate per-GPU throughput at decode
gpu_bw = {
'H100': 3.35e12, 'A100': 2.0e12, 'L40S': 0.864e12,
'RTX4090': 1.008e12,
}
bw = gpu_bw.get(gpu_type, 2.0e12)
bytes_per_weight = bits_per_weight / 8
model_bytes = model_params_B * 1e9 * bytes_per_weight
# Decode throughput: limited by BW (for small batches)
time_per_token = model_bytes / bw # seconds
tokens_per_sec_per_gpu = avg_batch_size / time_per_token
# Number of GPUs needed
num_gpus = max(1, int(
target_total_throughput_tps / tokens_per_sec_per_gpu + 0.99
))
# Cost per 1M tokens
seconds_per_1M = 1e6 / target_total_throughput_tps
hours_per_1M = seconds_per_1M / 3600
cost_per_1M = hours_per_1M * gpu_hourly_cost * num_gpus
return {
'num_gpus': num_gpus,
'per_gpu_tps': tokens_per_sec_per_gpu,
'cost_per_1M_tokens': cost_per_1M,
}
# Compare costs: 70B model on H100s
print("70B Model Serving Cost Comparison (H100 @ $3/hr, target: 100 tok/s)")
for label, bits in [('FP16', 16), ('INT8', 8), ('INT4', 4)]:
result = serving_cost_estimate(
model_params_B=70,
bits_per_weight=bits,
gpu_type='H100',
gpu_hourly_cost=3.0,
target_total_throughput_tps=100,
avg_batch_size=4,
)
print(f" {label:>5s}: {result['num_gpus']} GPUs, "
f"${result['cost_per_1M_tokens']:.2f}/1M tokens")
Serving Cost: 70B Model on H100 ($3/hr per GPU)
| Precision | GPUs Needed | Cost per 1M Tokens | vs FP16 |
|---|---|---|---|
| FP16 | 4 | $33.33 | 1.0x |
| INT8 (W8A8) | 2 | $16.67 | 0.5x |
| INT4 (W4A16) | 1 | $8.33 | 0.25x |
Serving Cost per 1M Tokens: 70B Model
(USD per 1M tokens)The Master Decision Table
Quantization Decision Matrix
| Scenario | GPU | Batch | Method | Quality | Speed vs FP16 |
|---|---|---|---|---|---|
| 7B chat, 1 user | A100 | 1 | AWQ INT4 + Marlin | +0.04 ppl | 3.3x |
| 7B chat, 1 user | RTX4090 | 1 | AWQ INT4 + ExLlama | +0.04 ppl | 3.0x |
| 7B chat, 1 user | CPU (M2) | 1 | GGUF Q4_K_M | +0.16 ppl | N/A |
| 13B API, 32 users | A100 | 32 | AWQ INT4 + Marlin | +0.04 ppl | 2.5x |
| 70B chat, 8 users | H100 | 8 | AWQ INT4 + Marlin | +0.04 ppl | 3.8x |
| 70B batch, 128 req | 2xH100 | 128 | FP8 W8A8 (TRT-LLM) | +0.02 ppl | 1.5x |
| 70B quality-critical | 2xH100 | 8 | AWQ INT4 g128 | +0.04 ppl | 3.5x |
| 405B any | 8xH100 | varies | FP8 W8A8 (TP=8) | +0.02 ppl | 1.5x |
| 7B edge device | None | 1 | GGUF Q4_K_M | +0.16 ppl | N/A |
Deployment Checklist
DEPLOYMENT_CHECKLIST = [
"1. Profile production traffic: batch size distribution, prompt/completion ratio",
"2. Estimate memory: model weights + KV cache for peak batch * peak context",
"3. Determine regime: bandwidth-bound (W4A16) or compute-bound (W8A8/FP8)",
"4. Select method: AWQ/GPTQ for W4A16, SmoothQuant for INT8, TE for FP8",
"5. Quantize model with calibration data from your domain",
"6. Validate: perplexity on domain holdout set",
"7. Validate: task-specific accuracy if applicable",
"8. Validate: latency/throughput meets SLOs at peak load",
"9. Monitor: track output quality metrics in production",
"10. Re-quantize when model is updated (quantization is not transferable)",
]
COMMON_MISTAKES = [
"Using per-tensor activation scaling for W8A8 (use per-token)",
"Using GPTQ with act_order and expecting Marlin compatibility",
"Quantizing to INT4 for a batch-API workload (no throughput benefit)",
"Not leaving memory for KV cache (model fits but inference OOMs)",
"Using FP8 on A100 (no FP8 tensor cores)",
"Not validating on domain-specific data (Wiki perplexity != your task)",
"Mixing AWQ quantized model with non-AWQ inference kernel",
]