You quantized your 70B model to INT4 using GPTQ, dropped perplexity by 0.4 points, and the checkpoint is 35 GB. Now you need to serve it. vLLM supports GPTQ but uses the Marlin kernel, which requires a specific weight layout that your checkpoint might not have. TensorRT-LLM supports GPTQ natively but achieves higher throughput if you convert to AWQ format first. llama.cpp does not support GPTQ at all—it requires GGUF conversion, which means re-quantizing. The quantization format, the serving engine, and the kernel implementation form a three-way dependency that determines whether your quantized model delivers 2x speedup or 0.8x slowdown versus FP16.
This post covers the practical integration path for the three dominant serving engines: what quantization formats each supports, how to convert between them, which kernels each engine uses under the hood, what configuration knobs matter, and where each engine excels.
Quantization Format Landscape
Before diving into engines, here is the format compatibility matrix.
Quantization Format Support by Serving Engine (as of March 2025)
| Format | Precision | vLLM | TRT-LLM | llama.cpp | Key Feature |
|---|---|---|---|---|---|
| GPTQ | INT4/INT8 | Yes (Marlin kernel) | Yes (via conversion) | No | Post-training, per-group |
| AWQ | INT4 | Yes (Marlin kernel) | Yes (native) | No | Activation-aware scaling |
| FP8 (per-tensor) | FP8 E4M3 | Yes (native) | Yes (native) | No | H100/B200 only |
| FP8 (per-channel) | FP8 E4M3 | Yes | Yes | No | Higher quality than per-tensor |
| GGUF Q4_K_M | Mixed INT4/INT6 | No | No | Yes (native) | CPU+GPU, k-quant groups |
| GGUF Q5_K_M | Mixed INT5/INT6 | No | No | Yes (native) | Higher quality k-quant |
| GGUF Q8_0 | INT8 | No | No | Yes (native) | Minimal quality loss |
| SmoothQuant | INT8 | Yes | Yes (native) | No | Per-channel + migration |
| bitsandbytes NF4 | NF4 | Yes (limited) | No | No | Normal float 4-bit |
vLLM Quantized Model Serving
vLLM is the most widely deployed open-source LLM serving engine. Its quantization support is kernel-driven: the engine selects the appropriate dequantization kernel based on the model’s quantization format and the available hardware.
Loading a GPTQ Model
# Install vLLM with quantization support
pip install vllm
# Serve a GPTQ-quantized model
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-70B-GPTQ \
--quantization gptq \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90
vLLM automatically selects the Marlin kernel for GPTQ models when running on SM 8.0+ (A100/H100). The kernel selection logic:
# vLLM kernel selection (simplified from vllm/model_executor/layers/quantization/)
def select_gemm_kernel(quant_config, device_capability):
if quant_config.method == "gptq":
if device_capability >= (8, 0) and quant_config.bits == 4:
if quant_config.group_size in [64, 128] and not quant_config.desc_act:
return MarlinKernel # Fastest path
return ExllamaV2Kernel # Fallback for desc_act=True
return GPTQCudaKernel # Pre-Ampere fallback
elif quant_config.method == "awq":
if device_capability >= (8, 0) and quant_config.bits == 4:
return MarlinKernel # AWQ uses same Marlin kernel
return AWQCudaKernel
elif quant_config.method == "fp8":
if device_capability >= (8, 9): # H100+
return Fp8Kernel # Native FP8 Tensor Cores
raise ValueError("FP8 requires Hopper or newer")
Loading an AWQ Model
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--enforce-eager # Disable CUDA graphs for debugging
FP8 on vLLM
FP8 quantization on vLLM requires H100 or newer. The model can be quantized online (during loading) or loaded from a pre-quantized checkpoint:
# Online FP8 quantization (quantize FP16 model on load)
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
quantization="fp8",
tensor_parallel_size=4,
# FP8 reduces weights from 140GB to 70GB
# Fits on 4x H100 with room for large KV cache
)
# Or load pre-quantized FP8 checkpoint
llm = LLM(
model="neuralmagic/Meta-Llama-3-70B-FP8",
tensor_parallel_size=4,
)
The Marlin kernel in vLLM achieves 3.5-3.8x speedup over FP16 for decode (batch=1) on H100. For prefill with large batches, the speedup drops to 1.2-1.5x because the compute-bound GEMM is limited by dequantization overhead. vLLM automatically uses FP16 cuBLAS for prefill when the batch size exceeds a threshold (typically 32-64).
vLLM Decode Throughput by Quantization Format (Llama 70B, H100, batch=1)
(tokens/sec)TensorRT-LLM Quantized Model Serving
TensorRT-LLM (TRT-LLM) is NVIDIA’s optimized inference engine. It compiles models into TensorRT engines with fused kernels and supports INT4, INT8, and FP8 quantization natively.
Building a Quantized Engine
TRT-LLM requires an explicit build step that generates a serialized engine file:
# Step 1: Convert HuggingFace model to TRT-LLM checkpoint format
python convert_checkpoint.py \
--model_dir /models/Llama-2-70b-hf \
--output_dir /checkpoints/llama-70b-int4 \
--dtype float16 \
--quant_ckpt_path /models/Llama-2-70B-GPTQ \
--use_weight_only \
--weight_only_precision int4 \
--per_group \
--group_size 128 \
--tp_size 2
# Step 2: Build TRT engine
trtllm-build \
--checkpoint_dir /checkpoints/llama-70b-int4 \
--output_dir /engines/llama-70b-int4 \
--gemm_plugin float16 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_output_len 2048 \
--max_beam_width 1 \
--workers 2
# Step 3: Serve with Triton
python launch_triton_server.py \
--model_repo /triton_model_repo \
--tensorrt_llm_model_name llama-70b-int4
TRT-LLM FP8 Quantization
TRT-LLM has the most mature FP8 support. It uses per-tensor or per-channel FP8 scales calibrated on a small dataset:
# TRT-LLM FP8 quantization with AMMO (NVIDIA's quantization toolkit)
import ammo.torch.quantization as atq
from ammo.torch.export import export_model_config
# Load FP16 model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf")
# Calibrate FP8 scales
quant_config = atq.FP8_DEFAULT_CFG
atq.quantize(model, quant_config, forward_loop=calibration_loop)
# Export to TRT-LLM checkpoint
export_model_config(
model,
decoder_type="llama",
dtype="float16",
quantization="fp8",
export_dir="/checkpoints/llama-70b-fp8"
)
TRT-LLM INT4 AWQ
TRT-LLM has a dedicated AWQ quantization path that uses NVIDIA’s custom kernels:
# AWQ quantization within TRT-LLM
from tensorrt_llm.quantization import quantize_awq
quantized_model = quantize_awq(
model_dir="/models/Llama-2-70b-hf",
output_dir="/checkpoints/llama-70b-awq",
quant_config={
"bits": 4,
"group_size": 128,
"zero_point": True,
"calib_size": 512,
"calib_dataset": "c4",
},
tensor_parallel_size=2,
)
TRT-LLM Build Time and Engine Size (Llama 70B)
| Quantization | Build Time | Engine Size | GPU RAM (runtime) | Max Batch (1x H100) |
|---|---|---|---|---|
| FP16 | 45 min | 140 GB | ~145 GB (TP=2) | 32 |
| FP8 | 50 min | 70 GB | ~75 GB | 64 |
| INT8 (SmoothQuant) | 55 min | 70 GB | ~78 GB | 64 |
| INT4 (AWQ) | 60 min | 37 GB | ~42 GB | 128 |
| INT4 (GPTQ) | 60 min | 37 GB | ~42 GB | 128 |
TRT-LLM engines are NOT portable across GPU architectures. An engine built for H100 will not run on A100. You must rebuild the engine for each target GPU. The engine is also tied to the specific TRT-LLM version and CUDA version used during build.
llama.cpp Quantized Model Serving
llama.cpp uses the GGUF format with its own quantization types (Q4_0, Q4_K_M, Q5_K_M, Q8_0, etc.). The “K-quant” formats use mixed precision within each group, storing more important weights at higher precision.
Converting to GGUF
# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
/models/Llama-2-70b-hf \
--outfile llama-70b-f16.gguf \
--outtype f16
# Quantize to INT4 (K-quant medium)
./quantize llama-70b-f16.gguf llama-70b-q4_k_m.gguf Q4_K_M
# Available quantization types (by quality):
# Q2_K: 2.5 bits/weight, lowest quality
# Q3_K_M: 3.4 bits/weight
# Q4_0: 4.0 bits/weight, basic INT4
# Q4_K_M: 4.8 bits/weight, mixed precision (recommended)
# Q5_K_M: 5.5 bits/weight, higher quality
# Q6_K: 6.5 bits/weight
# Q8_0: 8.0 bits/weight, minimal quality loss
K-Quant Internals
The K-quant format uses a “super-block” structure where different sub-groups within a block use different bit widths:
// Q4_K_M block structure (from ggml-quants.h)
typedef struct {
ggml_half d; // Super-block scale (FP16)
ggml_half dmin; // Super-block minimum (FP16)
uint8_t scales[12]; // Sub-block scales and mins (6-bit each)
uint8_t qs[128]; // 256 INT4 quantized values packed as 128 bytes
} block_q4_K;
// sizeof(block_q4_K) = 144 bytes for 256 weights
// Effective: 144 * 8 / 256 = 4.5 bits per weight
// Q5_K_M adds a high-bit plane
typedef struct {
ggml_half d;
ggml_half dmin;
uint8_t scales[12];
uint8_t qh[32]; // High bits (1 bit per weight, 256/8 = 32 bytes)
uint8_t qs[128]; // Low 4 bits (same as Q4_K)
} block_q5_K;
// sizeof(block_q5_K) = 176 bytes for 256 weights
// Effective: 176 * 8 / 256 = 5.5 bits per weight
Serving with llama.cpp Server
# Start llama.cpp server with GPU offloading
./server \
--model llama-70b-q4_k_m.gguf \
--n-gpu-layers 80 \
--ctx-size 4096 \
--batch-size 512 \
--parallel 4 \
--threads 16 \
--host 0.0.0.0 \
--port 8080 \
--flash-attn
# GPU layer offloading:
# -ngl 0: All layers on CPU (slowest)
# -ngl 40: Half layers on GPU (mixed CPU/GPU)
# -ngl 80: All layers on GPU (fastest, requires enough VRAM)
llama.cpp Quantization Type vs Quality (Llama 70B, Perplexity on WikiText-2)
(perplexity (lower is better))Cross-Engine Performance Comparison
Comparing engines requires controlling for hardware, model, and workload. Here is a standardized comparison on Llama 70B with INT4 quantization.
Serving Engine Comparison: Llama 70B INT4, H100 80GB
| Metric | vLLM (AWQ+Marlin) | TRT-LLM (AWQ) | llama.cpp (Q4_K_M) | Notes |
|---|---|---|---|---|
| Decode throughput (batch=1) | 68 tok/s | 74 tok/s | 45 tok/s | TRT-LLM has most fused kernels |
| Decode throughput (batch=32) | 1850 tok/s | 2100 tok/s | N/A | llama.cpp lacks continuous batching |
| Prefill latency (512 tokens) | 28 ms | 22 ms | 42 ms | TRT-LLM aggressive kernel fusion |
| Time to first token (batch=1) | 32 ms | 26 ms | 48 ms | Includes preprocessing |
| Max concurrent requests | 128 | 64 (engine limit) | 4 (--parallel) | vLLM best multi-request |
| Setup complexity | Low (pip install) | High (build engine) | Low (compile binary) | TRT-LLM requires expertise |
| GPU memory efficiency | 90%+ | Fixed at build time | 85-90% | vLLM PagedAttention best |
Format Conversion Between Engines
Models quantized for one engine often need conversion for another.
GPTQ to AWQ (via re-quantization)
There is no direct GPTQ-to-AWQ conversion because the quantization algorithms produce different weight values. You must re-quantize from the original FP16 model.
HuggingFace GPTQ to vLLM
# vLLM loads HuggingFace GPTQ models directly
# No conversion needed -- just point to the model directory
from vllm import LLM
llm = LLM(model="TheBloke/Llama-2-70B-GPTQ", quantization="gptq")
HuggingFace to GGUF
# Use the convert script from llama.cpp
# Handles weight name mapping and format conversion
import subprocess
subprocess.run([
"python", "convert_hf_to_gguf.py",
"/models/Llama-2-70b-hf",
"--outfile", "llama-70b-f16.gguf",
"--outtype", "f16"
])
# Then quantize to desired GGUF format
subprocess.run([
"./quantize",
"llama-70b-f16.gguf",
"llama-70b-q4_k_m.gguf",
"Q4_K_M"
])
GPTQ/AWQ to TRT-LLM
# TRT-LLM has dedicated conversion scripts
# From HuggingFace GPTQ checkpoint:
# python convert_checkpoint.py --model_dir <gptq_model> --output_dir <trt_ckpt>
# The conversion extracts:
# Packed INT4 weights (qweight)
# Scales (scales)
# Zero-points (qzeros)
# Group size from config.json
# And repacks them into TRT-LLM's internal layout
FP8 quantization is the easiest to deploy across engines because it does not require offline calibration datasets (for per-tensor quantization) and both vLLM and TRT-LLM support it natively on H100. The quality loss is minimal (typically less than 0.5% on benchmarks). If you have H100s, start with FP8 before trying INT4.
Production Configuration Recipes
vLLM Production Config (INT4 AWQ)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--speculative-model TheBloke/Llama-2-7B-AWQ \
--num-speculative-tokens 5 \
--swap-space 16 \
--disable-log-requests \
--uvicorn-log-level warning
TRT-LLM Production Config (FP8)
# Build engine with production settings
trtllm-build \
--checkpoint_dir /checkpoints/llama-70b-fp8 \
--output_dir /engines/llama-70b-fp8 \
--gemm_plugin fp8 \
--max_batch_size 128 \
--max_input_len 4096 \
--max_output_len 4096 \
--use_paged_context_fmha enable \
--use_fp8_context_fmha enable \
--workers 4 \
--multiple_profiles enable \
--reduce_fusion enable
llama.cpp Production Config (Q4_K_M)
./server \
--model llama-70b-q4_k_m.gguf \
--n-gpu-layers 80 \
--ctx-size 8192 \
--batch-size 2048 \
--ubatch-size 512 \
--parallel 8 \
--threads 8 \
--flash-attn \
--mlock \
--cont-batching \
--metrics \
--host 0.0.0.0 --port 8080
Troubleshooting Common Issues
vLLM: “Marlin kernel not supported”
# Marlin requires:
# SM >= 8.0 (A100 or newer)
# group_size in {64, 128}
# desc_act = False (no activation reordering)
# bits = 4
# Check your model's quantize_config.json:
# {"bits": 4, "group_size": 128, "desc_act": false}
# If desc_act=true, vLLM falls back to ExLlama v2 (slower)
# Solution: re-quantize with desc_act=false
TRT-LLM: Build OOM
# TRT-LLM build can require 2-3x the model size in CPU RAM
# For 70B model: need ~400GB CPU RAM during build
# Solution 1: Use --workers to parallelize and reduce per-worker memory
# Solution 2: Build on a high-memory machine, deploy engine elsewhere
# Solution 3: Use --strongly_typed to reduce optimization search space
llama.cpp: GPU Offloading Fails
# If -ngl 80 causes OOM:
# Check actual VRAM with nvidia-smi
# Reduce layers: -ngl 60 (keep some on CPU)
# Reduce context: --ctx-size 2048
# Use smaller quantization: Q3_K_M instead of Q4_K_M
# Memory estimation:
# Q4_K_M model size + KV cache + compute buffers
# Llama 70B Q4_K_M: ~38 GB model + ~4 GB KV (ctx=4096) = ~42 GB
# Fits on 1x 48GB GPU (A6000, L40S) or 1x 80GB (A100, H100)
Time to Deploy (From FP16 Checkpoint to Serving First Request)
(minutes)When to Use Which Engine
Engine Selection Guide
| Scenario | Recommended Engine | Quantization | Why |
|---|---|---|---|
| Multi-user API serving, datacenter GPU | vLLM | AWQ INT4 or FP8 | Best batching, PagedAttention, easy setup |
| Maximum single-stream throughput | TRT-LLM | FP8 | Most aggressive kernel fusion |
| Consumer GPU (RTX 4090) | llama.cpp | Q4_K_M | Best memory efficiency, no CUDA toolkit needed |
| CPU-only server | llama.cpp | Q4_K_M | Only engine with competitive CPU perf |
| Edge deployment (Jetson) | TRT-LLM | INT8 | TensorRT optimized for NVIDIA edge |
| Rapid prototyping | vLLM | FP8 online | No pre-quantization step needed |
| Apple Silicon Mac | llama.cpp | Q4_K_M | Metal GPU support built-in |
Summary
Serving quantized models requires matching the quantization format to the serving engine’s supported kernels. vLLM provides the easiest path with direct HuggingFace model loading, automatic Marlin kernel selection, and PagedAttention for efficient memory management. TRT-LLM delivers the highest raw throughput but requires an explicit engine build step and GPU-specific compilation. llama.cpp offers the broadest hardware support (NVIDIA, AMD, Apple, CPU) with its GGUF format and K-quant mixed-precision types. For datacenter deployments on NVIDIA hardware, start with vLLM + AWQ/FP8. Upgrade to TRT-LLM when you need every last percent of throughput and can invest in the build pipeline.