Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration

Part of Series Quantization Masterclass 27 of 30

1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest — Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier — Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 — How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization — The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization — OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment

You quantized your 70B model to INT4 using GPTQ, dropped perplexity by 0.4 points, and the checkpoint is 35 GB. Now you need to serve it. vLLM supports GPTQ but uses the Marlin kernel, which requires a specific weight layout that your checkpoint might not have. TensorRT-LLM supports GPTQ natively but achieves higher throughput if you convert to AWQ format first. llama.cpp does not support GPTQ at all—it requires GGUF conversion, which means re-quantizing. The quantization format, the serving engine, and the kernel implementation form a three-way dependency that determines whether your quantized model delivers 2x speedup or 0.8x slowdown versus FP16.

This post covers the practical integration path for the three dominant serving engines: what quantization formats each supports, how to convert between them, which kernels each engine uses under the hood, what configuration knobs matter, and where each engine excels.

Quantization Format Landscape

Before diving into engines, here is the format compatibility matrix.

📊

Quantization Format Support by Serving Engine (as of March 2025)

Format	Precision	vLLM	TRT-LLM	llama.cpp	Key Feature
GPTQ	INT4/INT8	Yes (Marlin kernel)	Yes (via conversion)	No	Post-training, per-group
AWQ	INT4	Yes (Marlin kernel)	Yes (native)	No	Activation-aware scaling
FP8 (per-tensor)	FP8 E4M3	Yes (native)	Yes (native)	No	H100/B200 only
FP8 (per-channel)	FP8 E4M3	Yes	Yes	No	Higher quality than per-tensor
GGUF Q4_K_M	Mixed INT4/INT6	No	No	Yes (native)	CPU+GPU, k-quant groups
GGUF Q5_K_M	Mixed INT5/INT6	No	No	Yes (native)	Higher quality k-quant
GGUF Q8_0	INT8	No	No	Yes (native)	Minimal quality loss
SmoothQuant	INT8	Yes	Yes (native)	No	Per-channel + migration
bitsandbytes NF4	NF4	Yes (limited)	No	No	Normal float 4-bit

Note: vLLM and TRT-LLM target datacenter GPUs (A100/H100). llama.cpp targets consumer hardware and CPU inference.

vLLM Quantized Model Serving

vLLM is the most widely deployed open-source LLM serving engine. Its quantization support is kernel-driven: the engine selects the appropriate dequantization kernel based on the model’s quantization format and the available hardware.

Loading a GPTQ Model

# Install vLLM with quantization support
pip install vllm

# Serve a GPTQ-quantized model
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-70B-GPTQ \
    --quantization gptq \
    --tensor-parallel-size 2 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90

vLLM automatically selects the Marlin kernel for GPTQ models when running on SM 8.0+ (A100/H100). The kernel selection logic:

# vLLM kernel selection (simplified from vllm/model_executor/layers/quantization/)
def select_gemm_kernel(quant_config, device_capability):
    if quant_config.method == "gptq":
        if device_capability >= (8, 0) and quant_config.bits == 4:
            if quant_config.group_size in [64, 128] and not quant_config.desc_act:
                return MarlinKernel  # Fastest path
            return ExllamaV2Kernel   # Fallback for desc_act=True
        return GPTQCudaKernel        # Pre-Ampere fallback

    elif quant_config.method == "awq":
        if device_capability >= (8, 0) and quant_config.bits == 4:
            return MarlinKernel      # AWQ uses same Marlin kernel
        return AWQCudaKernel

    elif quant_config.method == "fp8":
        if device_capability >= (8, 9):  # H100+
            return Fp8Kernel          # Native FP8 Tensor Cores
        raise ValueError("FP8 requires Hopper or newer")

Loading an AWQ Model

python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-70B-AWQ \
    --quantization awq \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --enforce-eager  # Disable CUDA graphs for debugging

FP8 on vLLM

FP8 quantization on vLLM requires H100 or newer. The model can be quantized online (during loading) or loaded from a pre-quantized checkpoint:

# Online FP8 quantization (quantize FP16 model on load)
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    quantization="fp8",
    tensor_parallel_size=4,
    # FP8 reduces weights from 140GB to 70GB
    # Fits on 4x H100 with room for large KV cache
)

# Or load pre-quantized FP8 checkpoint
llm = LLM(
    model="neuralmagic/Meta-Llama-3-70B-FP8",
    tensor_parallel_size=4,
)

⚡ vLLM Marlin Kernel Performance

The Marlin kernel in vLLM achieves 3.5-3.8x speedup over FP16 for decode (batch=1) on H100. For prefill with large batches, the speedup drops to 1.2-1.5x because the compute-bound GEMM is limited by dequantization overhead. vLLM automatically uses FP16 cuBLAS for prefill when the batch size exceeds a threshold (typically 32-64).

vLLM Decode Throughput by Quantization Format (Llama 70B, H100, batch=1)

(tokens/sec)

FP16 (2x H100 TP)

26 tokens/sec

FP8 (1x H100)

52 tokens/sec

GPTQ INT4 (Marlin, 1x H100)

68 tokens/sec

AWQ INT4 (Marlin, 1x H100)

66 tokens/sec

TensorRT-LLM Quantized Model Serving

TensorRT-LLM (TRT-LLM) is NVIDIA’s optimized inference engine. It compiles models into TensorRT engines with fused kernels and supports INT4, INT8, and FP8 quantization natively.

Building a Quantized Engine

TRT-LLM requires an explicit build step that generates a serialized engine file:

# Step 1: Convert HuggingFace model to TRT-LLM checkpoint format
python convert_checkpoint.py \
    --model_dir /models/Llama-2-70b-hf \
    --output_dir /checkpoints/llama-70b-int4 \
    --dtype float16 \
    --quant_ckpt_path /models/Llama-2-70B-GPTQ \
    --use_weight_only \
    --weight_only_precision int4 \
    --per_group \
    --group_size 128 \
    --tp_size 2

# Step 2: Build TRT engine
trtllm-build \
    --checkpoint_dir /checkpoints/llama-70b-int4 \
    --output_dir /engines/llama-70b-int4 \
    --gemm_plugin float16 \
    --max_batch_size 64 \
    --max_input_len 2048 \
    --max_output_len 2048 \
    --max_beam_width 1 \
    --workers 2

# Step 3: Serve with Triton
python launch_triton_server.py \
    --model_repo /triton_model_repo \
    --tensorrt_llm_model_name llama-70b-int4

TRT-LLM FP8 Quantization

TRT-LLM has the most mature FP8 support. It uses per-tensor or per-channel FP8 scales calibrated on a small dataset:

# TRT-LLM FP8 quantization with AMMO (NVIDIA's quantization toolkit)
import ammo.torch.quantization as atq
from ammo.torch.export import export_model_config

# Load FP16 model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf")

# Calibrate FP8 scales
quant_config = atq.FP8_DEFAULT_CFG
atq.quantize(model, quant_config, forward_loop=calibration_loop)

# Export to TRT-LLM checkpoint
export_model_config(
    model,
    decoder_type="llama",
    dtype="float16",
    quantization="fp8",
    export_dir="/checkpoints/llama-70b-fp8"
)

TRT-LLM INT4 AWQ

TRT-LLM has a dedicated AWQ quantization path that uses NVIDIA’s custom kernels:

# AWQ quantization within TRT-LLM
from tensorrt_llm.quantization import quantize_awq

quantized_model = quantize_awq(
    model_dir="/models/Llama-2-70b-hf",
    output_dir="/checkpoints/llama-70b-awq",
    quant_config={
        "bits": 4,
        "group_size": 128,
        "zero_point": True,
        "calib_size": 512,
        "calib_dataset": "c4",
    },
    tensor_parallel_size=2,
)

📊

TRT-LLM Build Time and Engine Size (Llama 70B)

Quantization	Build Time	Engine Size	GPU RAM (runtime)	Max Batch (1x H100)
FP16	45 min	140 GB	~145 GB (TP=2)	32
FP8	50 min	70 GB	~75 GB	64
INT8 (SmoothQuant)	55 min	70 GB	~78 GB	64
INT4 (AWQ)	60 min	37 GB	~42 GB	128
INT4 (GPTQ)	60 min	37 GB	~42 GB	128

Note: Build time includes weight conversion and TensorRT optimization passes. Engine size is the serialized file on disk.

⚠️ TRT-LLM Engine Portability

TRT-LLM engines are NOT portable across GPU architectures. An engine built for H100 will not run on A100. You must rebuild the engine for each target GPU. The engine is also tied to the specific TRT-LLM version and CUDA version used during build.

llama.cpp Quantized Model Serving

llama.cpp uses the GGUF format with its own quantization types (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.). The “K-quant” formats use mixed precision within each group, storing more important weights at higher precision.

Converting to GGUF

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
    /models/Llama-2-70b-hf \
    --outfile llama-70b-f16.gguf \
    --outtype f16

# Quantize to INT4 (K-quant medium)
./quantize llama-70b-f16.gguf llama-70b-q4_k_m.gguf Q4_K_M

# Available quantization types (by quality):
# Q2_K:   2.5 bits/weight, lowest quality
# Q3_K_M: 3.4 bits/weight
# Q4_0:   4.0 bits/weight, basic INT4
# Q4_K_M: 4.8 bits/weight, mixed precision (recommended)
# Q5_K_M: 5.5 bits/weight, higher quality
# Q6_K:   6.5 bits/weight
# Q8_0:   8.0 bits/weight, minimal quality loss

K-Quant Internals

The K-quant format uses a “super-block” structure where different sub-groups within a block use different bit widths:

// Q4_K_M block structure (from ggml-quants.h)
typedef struct {
    ggml_half d;          // Super-block scale (FP16)
    ggml_half dmin;       // Super-block minimum (FP16)
    uint8_t scales[12];   // Sub-block scales and mins (6-bit each)
    uint8_t qs[128];      // 256 INT4 quantized values packed as 128 bytes
} block_q4_K;
// sizeof(block_q4_K) = 144 bytes for 256 weights
// Effective: 144 * 8 / 256 = 4.5 bits per weight

// Q5_K_M adds a high-bit plane
typedef struct {
    ggml_half d;
    ggml_half dmin;
    uint8_t scales[12];
    uint8_t qh[32];       // High bits (1 bit per weight, 256/8 = 32 bytes)
    uint8_t qs[128];      // Low 4 bits (same as Q4_K)
} block_q5_K;
// sizeof(block_q5_K) = 176 bytes for 256 weights
// Effective: 176 * 8 / 256 = 5.5 bits per weight

Serving with llama.cpp Server

# Start llama.cpp server with GPU offloading
./server \
    --model llama-70b-q4_k_m.gguf \
    --n-gpu-layers 80 \
    --ctx-size 4096 \
    --batch-size 512 \
    --parallel 4 \
    --threads 16 \
    --host 0.0.0.0 \
    --port 8080 \
    --flash-attn

# GPU layer offloading:
# -ngl 0:  All layers on CPU (slowest)
# -ngl 40: Half layers on GPU (mixed CPU/GPU)
# -ngl 80: All layers on GPU (fastest, requires enough VRAM)

llama.cpp Quantization Type vs Quality (Llama 70B, Perplexity on WikiText-2)

(perplexity (lower is better))

FP16 (baseline)

3.32 perplexity (lower is better)

Q8_0 +0.3%

3.33 perplexity (lower is better)

Q5_K_M +3.0%

3.42 perplexity (lower is better)

Q4_K_M +7.8%

3.58 perplexity (lower is better)

Q4_0 +12.7%

3.74 perplexity (lower is better)

Q3_K_M +24.1%

4.12 perplexity (lower is better)

Q2_K +75.0%

5.81 perplexity (lower is better)

Cross-Engine Performance Comparison

Comparing engines requires controlling for hardware, model, and workload. Here is a standardized comparison on Llama 70B with INT4 quantization.

📊

Serving Engine Comparison: Llama 70B INT4, H100 80GB

Metric	vLLM (AWQ+Marlin)	TRT-LLM (AWQ)	llama.cpp (Q4_K_M)	Notes
Decode throughput (batch=1)	68 tok/s	74 tok/s	45 tok/s	TRT-LLM has most fused kernels
Decode throughput (batch=32)	1850 tok/s	2100 tok/s	N/A	llama.cpp lacks continuous batching
Prefill latency (512 tokens)	28 ms	22 ms	42 ms	TRT-LLM aggressive kernel fusion
Time to first token (batch=1)	32 ms	26 ms	48 ms	Includes preprocessing
Max concurrent requests	128	64 (engine limit)	4 (--parallel)	vLLM best multi-request
Setup complexity	Low (pip install)	High (build engine)	Low (compile binary)	TRT-LLM requires expertise
GPU memory efficiency	90%+	Fixed at build time	85-90%	vLLM PagedAttention best

Note: llama.cpp excels at single-user interactive use. vLLM excels at multi-user serving. TRT-LLM excels at peak throughput when properly configured.

Format Conversion Between Engines

Models quantized for one engine often need conversion for another.

GPTQ to AWQ (via re-quantization)

There is no direct GPTQ-to-AWQ conversion because the quantization algorithms produce different weight values. You must re-quantize from the original FP16 model.

HuggingFace GPTQ to vLLM

# vLLM loads HuggingFace GPTQ models directly
# No conversion needed -- just point to the model directory
from vllm import LLM
llm = LLM(model="TheBloke/Llama-2-70B-GPTQ", quantization="gptq")

HuggingFace to GGUF

# Use the convert script from llama.cpp
# Handles weight name mapping and format conversion
import subprocess

subprocess.run([
    "python", "convert_hf_to_gguf.py",
    "/models/Llama-2-70b-hf",
    "--outfile", "llama-70b-f16.gguf",
    "--outtype", "f16"
])

# Then quantize to desired GGUF format
subprocess.run([
    "./quantize",
    "llama-70b-f16.gguf",
    "llama-70b-q4_k_m.gguf",
    "Q4_K_M"
])

GPTQ/AWQ to TRT-LLM

# TRT-LLM has dedicated conversion scripts
# From HuggingFace GPTQ checkpoint:
# python convert_checkpoint.py --model_dir <gptq_model> --output_dir <trt_ckpt>

# The conversion extracts:
# Packed INT4 weights (qweight)
# Scales (scales)
# Zero-points (qzeros)
# Group size from config.json
# And repacks them into TRT-LLM's internal layout

ℹ️ FP8 Is the Simplest Path

FP8 quantization is the easiest to deploy across engines because it does not require offline calibration datasets (for per-tensor quantization) and both vLLM and TRT-LLM support it natively on H100. The quality loss is minimal (typically less than 0.5% on benchmarks). If you have H100s, start with FP8 before trying INT4.

Production Configuration Recipes

vLLM Production Config (INT4 AWQ)

python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-70B-AWQ \
    --quantization awq \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 128 \
    --max-num-batched-tokens 8192 \
    --enable-chunked-prefill \
    --speculative-model TheBloke/Llama-2-7B-AWQ \
    --num-speculative-tokens 5 \
    --swap-space 16 \
    --disable-log-requests \
    --uvicorn-log-level warning

TRT-LLM Production Config (FP8)

# Build engine with production settings
trtllm-build \
    --checkpoint_dir /checkpoints/llama-70b-fp8 \
    --output_dir /engines/llama-70b-fp8 \
    --gemm_plugin fp8 \
    --max_batch_size 128 \
    --max_input_len 4096 \
    --max_output_len 4096 \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --workers 4 \
    --multiple_profiles enable \
    --reduce_fusion enable

llama.cpp Production Config (Q4_K_M)

./server \
    --model llama-70b-q4_k_m.gguf \
    --n-gpu-layers 80 \
    --ctx-size 8192 \
    --batch-size 2048 \
    --ubatch-size 512 \
    --parallel 8 \
    --threads 8 \
    --flash-attn \
    --mlock \
    --cont-batching \
    --metrics \
    --host 0.0.0.0 --port 8080

Troubleshooting Common Issues

vLLM: “Marlin kernel not supported”

# Marlin requires:
# SM >= 8.0 (A100 or newer)
# group_size in {64, 128}
# desc_act = False (no activation reordering)
# bits = 4

# Check your model's quantize_config.json:
# {"bits": 4, "group_size": 128, "desc_act": false}

# If desc_act=true, vLLM falls back to ExLlama v2 (slower)
# Solution: re-quantize with desc_act=false

TRT-LLM: Build OOM

# TRT-LLM build can require 2-3x the model size in CPU RAM
# For 70B model: need ~400GB CPU RAM during build

# Solution 1: Use --workers to parallelize and reduce per-worker memory
# Solution 2: Build on a high-memory machine, deploy engine elsewhere
# Solution 3: Use --strongly_typed to reduce optimization search space

llama.cpp: GPU Offloading Fails

# If -ngl 80 causes OOM:
# Check actual VRAM with nvidia-smi
# Reduce layers: -ngl 60 (keep some on CPU)
# Reduce context: --ctx-size 2048
# Use smaller quantization: Q3_K_M instead of Q4_K_M

# Memory estimation:
# Q4_K_M model size + KV cache + compute buffers
# Llama 70B Q4_K_M: ~38 GB model + ~4 GB KV (ctx=4096) = ~42 GB
# Fits on 1x 48GB GPU (A6000, L40S) or 1x 80GB (A100, H100)

Time to Deploy (From FP16 Checkpoint to Serving First Request)

(minutes)

vLLM (GPTQ, pre-quantized) Just load and serve

5 minutes

vLLM (FP8, online quant)

8 minutes

llama.cpp (GGUF convert + quantize)

45 minutes

TRT-LLM (convert + build + deploy) Engine build is slow

120 minutes

When to Use Which Engine

📊

Engine Selection Guide

Scenario	Recommended Engine	Quantization	Why
Multi-user API serving, datacenter GPU	vLLM	AWQ INT4 or FP8	Best batching, PagedAttention, easy setup
Maximum single-stream throughput	TRT-LLM	FP8	Most aggressive kernel fusion
Consumer GPU (RTX 4090)	llama.cpp	Q4_K_M	Best memory efficiency, no CUDA toolkit needed
CPU-only server	llama.cpp	Q4_K_M	Only engine with competitive CPU perf
Edge deployment (Jetson)	TRT-LLM	INT8	TensorRT optimized for NVIDIA edge
Rapid prototyping	vLLM	FP8 online	No pre-quantization step needed
Apple Silicon Mac	llama.cpp	Q4_K_M	Metal GPU support built-in

Note: vLLM is the default choice for most production deployments. TRT-LLM for peak performance when build complexity is acceptable. llama.cpp for consumer/edge devices.

Summary

Serving quantized models requires matching the quantization format to the serving engine’s supported kernels. vLLM provides the easiest path with direct HuggingFace model loading, automatic Marlin kernel selection, and PagedAttention for efficient memory management. TRT-LLM delivers the highest raw throughput but requires an explicit engine build step and GPU-specific compilation. llama.cpp offers the broadest hardware support (NVIDIA, AMD, Apple, CPU) with its GGUF format and K-quant mixed-precision types. For datacenter deployments on NVIDIA hardware, start with vLLM + AWQ/FP8. Upgrade to TRT-LLM when you need every last percent of throughput and can invest in the build pipeline.