Part of Series Quantization Masterclass 20 of 30
1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest — Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier — Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 — How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization — The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization — OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment

FP8 on H100 delivers 2x throughput over FP16—but only if your model is over 13B parameters and your batch size exceeds 16. Below that threshold, the FP8 tensor core dispatch overhead exceeds the compute savings and you get slower inference. INT4 via Marlin delivers 3.8x throughput on batch size 1, but requires AWQ or GPTQ quantization which adds 0.5-1.0 perplexity points—acceptable for chatbots, catastrophic for code generation. llama.cpp’s GGUF Q4_K_M format achieves 80% of Marlin’s throughput on CPU-only inference, making it the only viable option when you do not have GPUs. There is no universal best quantization method—the optimal choice depends on seven variables: model size, hardware, batch size distribution, latency SLO, quality floor, multi-GPU topology, and cost constraints.

This post is the decision framework. It maps those constraints to the optimal quantization method with worked examples for the most common deployment scenarios: single-GPU inference, multi-GPU tensor parallelism, CPU-only serving, and hybrid CPU+GPU offloading.

The Decision Variables

Seven variables determine the optimal quantization method:

class DeploymentConstraints:
    """The variables that determine quantization method selection."""

    def __init__(
        self,
        model_params_B,          # Model size in billions of parameters
        available_gpu_memory_GB,  # Per-GPU VRAM (0 for CPU-only)
        num_gpus,                 # Number of GPUs available
        gpu_type,                 # 'H100', 'A100', 'L40S', 'RTX4090', 'CPU', etc
        target_latency_ms,       # First-token latency SLO (TTFT)
        target_throughput_tps,   # Minimum decode tokens/sec per request
        max_batch_size,          # Maximum concurrent requests
        quality_floor_ppl,       # Maximum acceptable perplexity degradation
        cost_priority,           # 'minimize_cost', 'minimize_latency', 'balanced'
    ):
        self.model_params_B = model_params_B
        self.available_gpu_memory_GB = available_gpu_memory_GB
        self.num_gpus = num_gpus
        self.gpu_type = gpu_type
        self.target_latency_ms = target_latency_ms
        self.target_throughput_tps = target_throughput_tps
        self.max_batch_size = max_batch_size
        self.quality_floor_ppl = quality_floor_ppl
        self.cost_priority = cost_priority

Step 1: Does the Model Fit?

The first constraint is memory. The model must fit in available GPU memory (or system RAM for CPU inference), leaving room for KV cache, activations, and framework overhead.

def estimate_model_memory(
    params_B,
    bits_per_weight,
    kv_cache_tokens=2048,
    num_layers=32,
    num_kv_heads=8,
    head_dim=128,
    kv_cache_bits=16,
    batch_size=1,
):
    """Estimate total memory for model + KV cache.

    Returns memory in GB.
    """
    # Model weights
    weight_bytes = params_B * 1e9 * bits_per_weight / 8
    weight_GB = weight_bytes / 1e9

    # KV cache per token: 2 * num_layers * num_kv_heads * head_dim * bytes
    kv_bytes_per_token = (
        2 * num_layers * num_kv_heads * head_dim * kv_cache_bits / 8
    )
    kv_GB = (
        kv_bytes_per_token * kv_cache_tokens * batch_size / 1e9
    )

    # Framework overhead (buffers, CUDA context, etc): ~1-2 GB
    overhead_GB = 1.5

    total_GB = weight_GB + kv_GB + overhead_GB

    return {
        'weights_GB': weight_GB,
        'kv_cache_GB': kv_GB,
        'overhead_GB': overhead_GB,
        'total_GB': total_GB,
    }

# Memory estimation for common configurations
configs = [
    ('Llama-2 7B', 6.7, 32, 32, 128),
    ('Llama-2 13B', 13.0, 40, 40, 128),
    ('Llama-2 70B', 65.0, 80, 8, 128),
    ('Llama-3.1 8B', 8.0, 32, 8, 128),
    ('Llama-3.1 70B', 70.0, 80, 8, 128),
    ('Llama-3.1 405B', 405.0, 126, 8, 128),
]

for name, params, layers, kv_heads, hdim in configs:
    for bits in [16, 8, 4]:
        mem = estimate_model_memory(
            params, bits, kv_cache_tokens=4096,
            num_layers=layers, num_kv_heads=kv_heads,
            head_dim=hdim, batch_size=8
        )
        fits_80gb = "fits" if mem['total_GB'] <= 80 else "NO"
        fits_24gb = "fits" if mem['total_GB'] <= 24 else "NO"
        print(f"  {name:>20s} @ {bits:2d}b: {mem['total_GB']:6.1f} GB "
              f"[A100-80: {fits_80gb:>4s}, RTX4090-24: {fits_24gb:>4s}]")
    print()
📊

Model Memory at Different Precisions (batch=8, 4096 context)

ModelFP16 (GB)INT8 (GB)INT4 (GB)Fits A100-80?Fits RTX4090?
Llama-2 7B 15.2 9.1 6.1 Yes (all) INT4 only
Llama-2 13B 27.8 15.4 9.2 Yes (all) No
Llama-2 70B 132.5 67.8 35.4 INT4 only No
Llama-3.1 8B 17.8 10.4 6.7 Yes (all) INT4 only
Llama-3.1 70B 142.5 72.8 38.0 INT4 only No
Llama-3.1 405B 812.5 407.8 205.4 Multi-GPU INT4 No
Note: Memory includes weights + KV cache for batch=8 at 4096 tokens + 1.5 GB overhead. INT4 enables 70B on a single A100-80GB and 7B on an RTX 4090.

Step 2: Batch Size Determines Compute vs Bandwidth Regime

The critical bifurcation: is your workload bandwidth-bound (small batches, decode-heavy) or compute-bound (large batches, prefill-heavy)?

def classify_workload(
    batch_size,
    hidden_dim,
    gpu_type,
):
    """Classify workload as bandwidth-bound or compute-bound.

    Returns the crossover batch size and classification.
    """
    # GPU specs
    gpu_specs = {
        'H100': {'bw_TB_s': 3.35, 'fp16_TFLOPS': 990, 'int8_TOPS': 1979},
        'A100': {'bw_TB_s': 2.0, 'fp16_TFLOPS': 312, 'int8_TOPS': 624},
        'L40S': {'bw_TB_s': 0.864, 'fp16_TFLOPS': 366, 'int8_TOPS': 733},
        'RTX4090': {'bw_TB_s': 1.008, 'fp16_TFLOPS': 330, 'int8_TOPS': 660},
    }

    spec = gpu_specs.get(gpu_type, gpu_specs['A100'])

    # Crossover: batch_size where compute_time = bandwidth_time
    # For FP16: crossover = BW * 2 / (FP16_FLOPS / hidden_dim)
    # Simplified: crossover ~ BW_bytes/s / (FLOPS/s / (2 * hidden_dim))
    fp16_crossover = int(
        spec['bw_TB_s'] * 1e12 * 2 /
        (spec['fp16_TFLOPS'] * 1e12 / hidden_dim)
    )

    # For INT8: higher compute = higher crossover
    int8_crossover = int(
        spec['bw_TB_s'] * 1e12 * 1 /
        (spec['int8_TOPS'] * 1e12 / hidden_dim)
    )

    if batch_size < fp16_crossover:
        regime = 'bandwidth_bound'
        recommendation = 'W4A16 (minimize bytes loaded)'
    elif batch_size < int8_crossover * 2:
        regime = 'transitional'
        recommendation = 'W4A16 for decode, FP8/INT8 for prefill'
    else:
        regime = 'compute_bound'
        recommendation = 'W8A8 (utilize INT8/FP8 tensor cores)'

    return {
        'regime': regime,
        'fp16_crossover': fp16_crossover,
        'int8_crossover': int8_crossover,
        'recommendation': recommendation,
    }

for gpu in ['H100', 'A100', 'RTX4090']:
    result = classify_workload(batch_size=1, hidden_dim=4096, gpu_type=gpu)
    print(f"\n{gpu}:")
    print(f"  FP16 crossover: batch={result['fp16_crossover']}")
    print(f"  INT8 crossover: batch={result['int8_crossover']}")
    for bs in [1, 8, 32, 128, 512]:
        r = classify_workload(bs, 4096, gpu)
        print(f"  batch={bs:>3d}: {r['regime']:>20s} -> {r['recommendation']}")
The Batch Size Threshold

On H100, the crossover from bandwidth-bound to compute-bound occurs at batch size ~128 for FP16 GEMMs. Below this, W4A16 gives the best throughput (4x bandwidth reduction). Above this, W8A8 with INT8 or FP8 tensor cores gives the best throughput (2x compute increase). Most chat/interactive workloads are below the crossover; batch API workloads are above it.

Step 3: The Decision Tree

def recommend_quantization(constraints):
    """Full decision tree for quantization method selection.

    Returns primary recommendation and alternatives.
    """
    c = constraints
    recommendations = []

    # CPU-only path
    if c.gpu_type == 'CPU' or c.available_gpu_memory_GB == 0:
        if c.quality_floor_ppl <= 0.1:
            recommendations.append(('GGUF Q5_K_M', 'Best quality for CPU'))
        elif c.quality_floor_ppl <= 0.3:
            recommendations.append(('GGUF Q4_K_M', 'Standard CPU choice'))
        else:
            recommendations.append(('GGUF Q3_K_M', 'Minimum viable'))
        return recommendations

    # Check if model fits at different precisions
    fits_fp16 = estimate_model_memory(
        c.model_params_B, 16,
        batch_size=c.max_batch_size
    )['total_GB'] <= c.available_gpu_memory_GB * c.num_gpus

    fits_int8 = estimate_model_memory(
        c.model_params_B, 8,
        batch_size=c.max_batch_size
    )['total_GB'] <= c.available_gpu_memory_GB * c.num_gpus

    fits_int4 = estimate_model_memory(
        c.model_params_B, 4,
        batch_size=c.max_batch_size
    )['total_GB'] <= c.available_gpu_memory_GB * c.num_gpus

    if not fits_int4:
        recommendations.append((
            'Need more GPUs or smaller model',
            f'Model requires {estimate_model_memory(c.model_params_B, 4)["total_GB"]:.0f} GB at INT4'
        ))
        return recommendations

    # Determine regime
    regime = classify_workload(
        c.max_batch_size,
        int(c.model_params_B * 1e9 / (32 * 4096)),  # Rough hidden_dim
        c.gpu_type
    )['regime']

    # GPU-specific paths
    if c.gpu_type in ['H100', 'H200']:
        if regime == 'bandwidth_bound':
            if fits_fp16 and c.quality_floor_ppl <= 0.01:
                recommendations.append(('FP16', 'No quantization needed'))
            recommendations.append(('AWQ INT4 g128 + Marlin', 'Best decode throughput'))
            if c.quality_floor_ppl <= 0.05:
                recommendations.append(('GPTQ INT4 g128 + Marlin', 'Alternative to AWQ'))
        elif regime == 'compute_bound':
            recommendations.append(('FP8 W8A8 (Transformer Engine)', 'Best throughput for large batches'))
            recommendations.append(('SmoothQuant W8A8 INT8', 'Alternative without FP8'))
        else:
            recommendations.append(('AWQ INT4 + Marlin (decode)', 'Hybrid strategy'))
            recommendations.append(('+ FP8 W8A8 (prefill)', 'Different format per phase'))

    elif c.gpu_type == 'A100':
        if regime == 'bandwidth_bound':
            recommendations.append(('AWQ INT4 g128 + Marlin', 'Best decode throughput'))
        elif regime == 'compute_bound':
            recommendations.append(('SmoothQuant W8A8 INT8', 'Best for large batches on A100'))
        else:
            recommendations.append(('AWQ INT4 g128 + Marlin', 'Decode-optimized'))

    elif c.gpu_type in ['L40S', 'L4', 'RTX4090']:
        if not fits_int8:
            recommendations.append(('AWQ INT4 g128', 'Must use INT4 for memory'))
        else:
            recommendations.append(('AWQ INT4 g128 + ExLlamaV2', 'Best for consumer GPUs'))

    # Add quality-sensitive alternatives
    if c.quality_floor_ppl <= 0.02:
        recommendations.append(('Consider FP8 instead of INT4', 'Tighter quality requirement'))

    return recommendations

# Example usage
constraints = DeploymentConstraints(
    model_params_B=70,
    available_gpu_memory_GB=80,
    num_gpus=1,
    gpu_type='H100',
    target_latency_ms=100,
    target_throughput_tps=30,
    max_batch_size=8,
    quality_floor_ppl=0.3,
    cost_priority='minimize_cost',
)

recs = recommend_quantization(constraints)
for method, reason in recs:
    print(f"  {method}: {reason}")

Step 4: Kernel and Framework Selection

📊

Quantization Method to Kernel/Framework Mapping

MethodPrimary KernelFrameworkCompatibility Notes
AWQ INT4 g128 Marlin vLLM No act_order, symmetric only
GPTQ INT4 g128 Marlin vLLM No act_order for Marlin
GPTQ INT4 act_order ExLlamaV2 vLLM Slower than Marlin, better quality
FP8 W8A8 cuBLAS FP8 TensorRT-LLM, vLLM H100+ only
SmoothQuant INT8 CUTLASS INT8 vLLM A100+, per-token scaling
GGUF Q4_K_M llama.cpp llama.cpp / ollama CPU: AVX2/NEON, GPU: CUDA
GGUF Q5_K_M llama.cpp llama.cpp / ollama Higher quality, slightly slower
Note: Marlin (vLLM) is the fastest W4A16 kernel on NVIDIA GPUs. ExLlamaV2 supports act_order but is ~15% slower. TensorRT-LLM has the fastest FP8 inference. llama.cpp is the only option for CPU.
def select_serving_framework(method, gpu_type, deployment_type):
    """Select the optimal serving framework for the chosen quantization."""

    framework_map = {
        ('AWQ INT4', 'H100', 'cloud'): 'vLLM + Marlin',
        ('AWQ INT4', 'A100', 'cloud'): 'vLLM + Marlin',
        ('GPTQ INT4', 'H100', 'cloud'): 'vLLM + Marlin (if no act_order)',
        ('FP8 W8A8', 'H100', 'cloud'): 'TensorRT-LLM (fastest FP8)',
        ('INT8 W8A8', 'A100', 'cloud'): 'vLLM + SmoothQuant',
        ('GGUF', 'CPU', 'edge'): 'llama.cpp or ollama',
        ('GGUF', 'RTX4090', 'local'): 'llama.cpp (metal/cuda)',
        ('AWQ INT4', 'RTX4090', 'local'): 'vLLM or ExLlamaV2',
    }

    key = (method.split(' + ')[0] if ' + ' in method else method,
           gpu_type, deployment_type)

    return framework_map.get(key, 'vLLM (general purpose)')

Step 5: Quality Validation

Never deploy a quantized model without measuring quality on your target distribution:

def validate_quantized_model(
    original_model,
    quantized_model,
    eval_dataset,
    metrics,
):
    """Validate quantized model quality.

    Minimum validation:
    1. Perplexity on domain-specific holdout set
    2. Task-specific accuracy (if applicable)
    3. Output distribution comparison (KL divergence)
    4. Edge case testing (long context, rare tokens, etc)
    """
    results = {}

    # Perplexity comparison
    ppl_original = compute_perplexity(original_model, eval_dataset)
    ppl_quantized = compute_perplexity(quantized_model, eval_dataset)
    results['perplexity_original'] = ppl_original
    results['perplexity_quantized'] = ppl_quantized
    results['perplexity_degradation'] = ppl_quantized - ppl_original

    # KL divergence of output distributions
    kl_divs = []
    for batch in eval_dataset:
        with torch.no_grad():
            logits_orig = original_model(batch).logits
            logits_quant = quantized_model(batch).logits

        # KL(orig || quant) per token
        p = torch.softmax(logits_orig, dim=-1)
        q = torch.softmax(logits_quant, dim=-1)
        kl = (p * (p.log() - q.log())).sum(dim=-1).mean()
        kl_divs.append(kl.item())

    results['mean_kl_divergence'] = sum(kl_divs) / len(kl_divs)

    # Pass/fail criteria
    results['ppl_pass'] = results['perplexity_degradation'] < metrics['max_ppl_degradation']
    results['kl_pass'] = results['mean_kl_divergence'] < metrics['max_kl_divergence']

    return results

# Validation thresholds by use case
VALIDATION_THRESHOLDS = {
    'chat_general': {'max_ppl_degradation': 0.3, 'max_kl_divergence': 0.1},
    'code_generation': {'max_ppl_degradation': 0.2, 'max_kl_divergence': 0.05},
    'medical_legal': {'max_ppl_degradation': 0.05, 'max_kl_divergence': 0.01},
    'summarization': {'max_ppl_degradation': 0.5, 'max_kl_divergence': 0.2},
}

Worked Examples

Example 1: 7B Chat Model, Single A100-80GB, Interactive Use

# Deployment: Single A100-80GB, Llama-2 7B chat, 1-8 concurrent users
scenario_1 = DeploymentConstraints(
    model_params_B=7,
    available_gpu_memory_GB=80,
    num_gpus=1,
    gpu_type='A100',
    target_latency_ms=50,     # TTFT < 50ms
    target_throughput_tps=40,  # 40 tok/s per user
    max_batch_size=8,
    quality_floor_ppl=0.3,
    cost_priority='minimize_cost',
)

# Analysis:
# - Model fits at FP16 (15.2 GB) with plenty of room for KV cache
# - Batch size 8 -> bandwidth-bound regime on A100
# - Quality floor 0.3 ppl -> INT4 is acceptable

# Recommendation: AWQ INT4 g128 + Marlin in vLLM
# - Model size: ~4 GB -> leaves 76 GB for KV cache
# - Decode throughput: ~850 tok/s total across batch
# - Per-user throughput: ~106 tok/s (well above 40 tok/s target)
# - Quality: +0.04 ppl degradation (well within 0.3 floor)

# Alternative: FP16 (no quantization)
# - Model size: ~15 GB -> leaves 65 GB for KV cache
# - Decode throughput: ~258 tok/s total
# - Per-user: ~32 tok/s (BELOW 40 tok/s target!)
# -> FP16 FAILS the throughput SLO. Quantization is required.

Example 2: 70B Model, 2x H100, Batch API

# Deployment: 2x H100-80GB, Llama-2 70B, batch processing
scenario_2 = DeploymentConstraints(
    model_params_B=70,
    available_gpu_memory_GB=80,
    num_gpus=2,
    gpu_type='H100',
    target_latency_ms=5000,     # 5s TTFT acceptable for batch
    target_throughput_tps=10,    # 10 tok/s per request minimum
    max_batch_size=128,
    quality_floor_ppl=0.1,
    cost_priority='minimize_cost',
)

# Analysis:
# - FP16: 132.5 GB -> needs 2x H100 (fits, but no room for large batch KV)
# - INT4: 35.4 GB -> fits on 1x H100 with room for KV cache!
# - Batch 128 -> compute-bound on H100
# - Quality floor 0.1 ppl -> need high-quality quantization

# Recommendation: FP8 W8A8 + TensorRT-LLM on 2x H100
# - Model size (FP8): ~70 GB across 2 GPUs
# - Throughput: ~1.5x FP16 from FP8 tensor cores
# - Quality: +0.02 ppl (within 0.1 floor)

# Alternative: AWQ INT4 on 1x H100
# - Saves one GPU (cost reduction!)
# - But: batch 128 is compute-bound, INT4 does not help compute
# - Throughput would be similar to FP16 (no compute benefit)
# -> FP8 on 2 GPUs is better for high-throughput batch processing

Example 3: 7B Model on MacBook Pro (CPU/Metal)

# Deployment: MacBook Pro M2 Max, 32GB RAM, personal use
scenario_3 = DeploymentConstraints(
    model_params_B=7,
    available_gpu_memory_GB=0,  # CPU inference
    num_gpus=0,
    gpu_type='CPU',
    target_latency_ms=500,
    target_throughput_tps=15,   # 15 tok/s for interactive use
    max_batch_size=1,
    quality_floor_ppl=0.2,
    cost_priority='minimize_latency',
)

# Analysis:
# - CPU/Metal inference only -> llama.cpp with GGUF
# - 32 GB RAM -> model + KV cache must fit
# - Quality floor 0.2 ppl -> Q4_K_M or Q5_K_M

# Recommendation: GGUF Q5_K_M via llama.cpp
# - Model size: 4.5 GB (fits easily in 32 GB)
# - Throughput on M2 Max: ~24 tok/s (above 15 target)
# - Quality: +0.05 ppl (within 0.2 floor)

# Alternative: GGUF Q4_K_M
# - Model size: 3.8 GB
# - Throughput: ~28 tok/s (faster)
# - Quality: +0.16 ppl (still within floor)
# -> Q5_K_M is preferred for quality, Q4_K_M if speed matters more

Cost Optimization

The cost of serving a quantized model depends on GPU-hours. Lower precision = fewer GPUs = lower cost:

def serving_cost_estimate(
    model_params_B,
    bits_per_weight,
    gpu_type,
    gpu_hourly_cost,
    target_total_throughput_tps,
    avg_batch_size,
):
    """Estimate serving cost per 1M tokens."""

    # Estimate per-GPU throughput at decode
    gpu_bw = {
        'H100': 3.35e12, 'A100': 2.0e12, 'L40S': 0.864e12,
        'RTX4090': 1.008e12,
    }

    bw = gpu_bw.get(gpu_type, 2.0e12)
    bytes_per_weight = bits_per_weight / 8
    model_bytes = model_params_B * 1e9 * bytes_per_weight

    # Decode throughput: limited by BW (for small batches)
    time_per_token = model_bytes / bw  # seconds
    tokens_per_sec_per_gpu = avg_batch_size / time_per_token

    # Number of GPUs needed
    num_gpus = max(1, int(
        target_total_throughput_tps / tokens_per_sec_per_gpu + 0.99
    ))

    # Cost per 1M tokens
    seconds_per_1M = 1e6 / target_total_throughput_tps
    hours_per_1M = seconds_per_1M / 3600
    cost_per_1M = hours_per_1M * gpu_hourly_cost * num_gpus

    return {
        'num_gpus': num_gpus,
        'per_gpu_tps': tokens_per_sec_per_gpu,
        'cost_per_1M_tokens': cost_per_1M,
    }

# Compare costs: 70B model on H100s
print("70B Model Serving Cost Comparison (H100 @ $3/hr, target: 100 tok/s)")
for label, bits in [('FP16', 16), ('INT8', 8), ('INT4', 4)]:
    result = serving_cost_estimate(
        model_params_B=70,
        bits_per_weight=bits,
        gpu_type='H100',
        gpu_hourly_cost=3.0,
        target_total_throughput_tps=100,
        avg_batch_size=4,
    )
    print(f"  {label:>5s}: {result['num_gpus']} GPUs, "
          f"${result['cost_per_1M_tokens']:.2f}/1M tokens")
📊

Serving Cost: 70B Model on H100 ($3/hr per GPU)

PrecisionGPUs NeededCost per 1M Tokensvs FP16
FP16 4 $33.33 1.0x
INT8 (W8A8) 2 $16.67 0.5x
INT4 (W4A16) 1 $8.33 0.25x
Note: INT4 quantization reduces serving cost by 4x for a 70B model by fitting it on a single GPU. This is the primary economic motivation for quantization in production.

Serving Cost per 1M Tokens: 70B Model

(USD per 1M tokens)
FP16 (4 GPUs)
33.33 USD per 1M tokens
INT8 (2 GPUs)
16.67 USD per 1M tokens
INT4 (1 GPU) 4x cheaper
8.33 USD per 1M tokens

The Master Decision Table

📊

Quantization Decision Matrix

ScenarioGPUBatchMethodQualitySpeed vs FP16
7B chat, 1 user A100 1 AWQ INT4 + Marlin +0.04 ppl 3.3x
7B chat, 1 user RTX4090 1 AWQ INT4 + ExLlama +0.04 ppl 3.0x
7B chat, 1 user CPU (M2) 1 GGUF Q4_K_M +0.16 ppl N/A
13B API, 32 users A100 32 AWQ INT4 + Marlin +0.04 ppl 2.5x
70B chat, 8 users H100 8 AWQ INT4 + Marlin +0.04 ppl 3.8x
70B batch, 128 req 2xH100 128 FP8 W8A8 (TRT-LLM) +0.02 ppl 1.5x
70B quality-critical 2xH100 8 AWQ INT4 g128 +0.04 ppl 3.5x
405B any 8xH100 varies FP8 W8A8 (TP=8) +0.02 ppl 1.5x
7B edge device None 1 GGUF Q4_K_M +0.16 ppl N/A
Note: W4A16 with AWQ/GPTQ + Marlin is the default for single-GPU, low-batch serving. FP8 W8A8 is the default for multi-GPU, high-batch serving. GGUF is the only option for CPU.

Deployment Checklist

DEPLOYMENT_CHECKLIST = [
    "1. Profile production traffic: batch size distribution, prompt/completion ratio",
    "2. Estimate memory: model weights + KV cache for peak batch * peak context",
    "3. Determine regime: bandwidth-bound (W4A16) or compute-bound (W8A8/FP8)",
    "4. Select method: AWQ/GPTQ for W4A16, SmoothQuant for INT8, TE for FP8",
    "5. Quantize model with calibration data from your domain",
    "6. Validate: perplexity on domain holdout set",
    "7. Validate: task-specific accuracy if applicable",
    "8. Validate: latency/throughput meets SLOs at peak load",
    "9. Monitor: track output quality metrics in production",
    "10. Re-quantize when model is updated (quantization is not transferable)",
]

COMMON_MISTAKES = [
    "Using per-tensor activation scaling for W8A8 (use per-token)",
    "Using GPTQ with act_order and expecting Marlin compatibility",
    "Quantizing to INT4 for a batch-API workload (no throughput benefit)",
    "Not leaving memory for KV cache (model fits but inference OOMs)",
    "Using FP8 on A100 (no FP8 tensor cores)",
    "Not validating on domain-specific data (Wiki perplexity != your task)",
    "Mixing AWQ quantized model with non-AWQ inference kernel",
]