Quantizing Llama 70B from FP16 to INT4 cuts memory from 140 GB to 35 GB and doubles throughput — but only if you use the right kernel. A GPTQ-INT4 weight matrix needs a Marlin kernel with async weight decompression. An AWQ-INT4 matrix needs a different kernel with group-wise dequantization. An FP8 matrix can run on Tensor Cores directly. Pick the wrong kernel and your INT4 model runs slower than FP16. vLLM v1 abstracts this complexity behind a unified QuantizedLinear interface that detects the quantization method at model load time and dispatches to specialized kernels underneath. This post traces the complete pipeline from config.json detection through weight loading, kernel selection, and fused dequantization.
Quantization Method Detection
When vLLM loads a model, it inspects the quantization_config field in the HuggingFace config.json. The detection path is straightforward:
# vllm/config.py - simplified
class ModelConfig:
def _get_quantization(self) -> Optional[str]:
quant_cfg = self.hf_config.quantization_config
if quant_cfg is None:
return None
quant_method = quant_cfg.get("quant_method", "")
if quant_method in ("gptq", "awq", "fp8", "squeezellm", "marlin"):
return quant_method
# Check for bits field as fallback
bits = quant_cfg.get("bits", None)
if bits is not None:
return f"unknown_{bits}bit"
return None
The quant_method string is the primary dispatch key. Here is how each method stores its configuration:
// GPTQ config.json excerpt
{
"quantization_config": {
"quant_method": "gptq",
"bits": 4,
"group_size": 128,
"desc_act": false,
"sym": true
}
}
// AWQ config.json excerpt
{
"quantization_config": {
"quant_method": "awq",
"bits": 4,
"group_size": 128,
"zero_point": true,
"version": "gemm"
}
}
// FP8 config.json excerpt
{
"quantization_config": {
"quant_method": "fp8",
"activation_scheme": "dynamic"
}
}
The desc_act field in GPTQ determines whether activations are reordered by descending activation magnitude before quantization. When desc_act=true, an additional permutation index must be applied during inference, which prevents Marlin kernel usage and forces the slower GPTQ CUDA kernel path.
Weight Packing Formats
Each quantization method packs weights differently in memory. Understanding the bit layout is essential for debugging kernel mismatches.
GPTQ INT4 Packing
GPTQ packs 8 INT4 values into a single int32. The layout for a weight matrix of shape [out_features, in_features] becomes [out_features, in_features // 8] with dtype int32:
def pack_gptq_int4(weight_fp16: torch.Tensor) -> torch.Tensor:
"""Pack FP16 quantized weights into INT4 packed format."""
assert weight_fp16.dtype == torch.int8 # already quantized to int range
out_features, in_features = weight_fp16.shape
assert in_features % 8 == 0
packed = torch.zeros(
out_features, in_features // 8,
dtype=torch.int32, device=weight_fp16.device
)
for i in range(8):
packed |= (weight_fp16[:, i::8].to(torch.int32) & 0xF) << (4 * i)
return packed
The corresponding scales and zeros tensors have shape [out_features, in_features // group_size]. For a 7B model linear layer with out_features=4096, in_features=4096, group_size=128:
- FP16 weight: MB
- INT4 packed weight: MB
- Scales (FP16): MB
- Zeros (INT32): MB
- Total: MB (3.66x compression)
AWQ INT4 Packing
AWQ uses a different packing order. Where GPTQ interleaves elements at stride 8, AWQ packs consecutive elements:
def pack_awq_int4(weight_int8: torch.Tensor) -> torch.Tensor:
"""AWQ packs consecutive INT4 values."""
out_features, in_features = weight_int8.shape
packed = torch.zeros(
out_features, in_features // 8,
dtype=torch.int32, device=weight_int8.device
)
for i in range(8):
# Consecutive packing: elements [0,1,2,...,7] into one int32
col_start = torch.arange(0, in_features, 8)
packed[:, :] |= (
weight_int8[:, col_start + i].to(torch.int32) & 0xF
) << (4 * i)
return packed
FP8 Format
FP8 (E4M3) weights are stored as torch.float8_e4m3fn tensors with per-tensor or per-channel scale factors. No packing is needed since each element occupies exactly one byte:
# FP8 weight loading
weight_fp8 = safetensors_load(path, "model.layers.0.mlp.gate_proj.weight")
# Shape: [intermediate_size, hidden_size], dtype: float8_e4m3fn
scale = safetensors_load(path, "model.layers.0.mlp.gate_proj.weight_scale")
# Shape: scalar or [intermediate_size], dtype: float32
Weight Memory per Linear Layer (4096x4096)
| Format | Weight Size (MB) | Metadata Size (MB) | Total (MB) | Compression vs FP16 |
|---|---|---|---|---|
| FP16 | 33.55 | 0.00 | 33.55 | 1.00x |
| FP8 E4M3 | 16.78 | 0.01 | 16.79 | 2.00x |
| GPTQ INT4 g128 | 8.39 | 0.78 | 9.17 | 3.66x |
| AWQ INT4 g128 | 8.39 | 0.52 | 8.91 | 3.77x |
| GPTQ INT4 g32 | 8.39 | 3.15 | 11.54 | 2.91x |
Kernel Selection Logic
vLLM v1 maintains a registry of quantized GEMM kernels. The selection follows a priority order based on hardware capability and quantization parameters.
# vllm/model_executor/layers/quantization/__init__.py - simplified
QUANTIZATION_METHODS = {
"gptq": GPTQLinearMethod,
"gptq_marlin": GPTQMarlinLinearMethod,
"awq": AWQLinearMethod,
"awq_marlin": AWQMarlinLinearMethod,
"fp8": Fp8LinearMethod,
}
def get_quantization_method(quant_config: dict,
capability: int) -> LinearMethodBase:
method = quant_config["quant_method"]
bits = quant_config.get("bits", 8)
desc_act = quant_config.get("desc_act", False)
group_size = quant_config.get("group_size", -1)
# Marlin kernel check: requires SM80+, 4-bit, no desc_act,
# group_size in {-1, 32, 64, 128}
marlin_compatible = (
capability >= 80 and
bits == 4 and
not desc_act and
group_size in (-1, 32, 64, 128)
)
if method == "gptq" and marlin_compatible:
return GPTQMarlinLinearMethod(quant_config)
elif method == "awq" and marlin_compatible:
return AWQMarlinLinearMethod(quant_config)
elif method == "fp8" and capability >= 89:
return Fp8LinearMethod(quant_config) # Native FP8 on Ada/Hopper
elif method == "fp8":
return Fp8CastedLinearMethod(quant_config) # Emulated FP8
else:
return QUANTIZATION_METHODS[method](quant_config)
The critical decision point is Marlin eligibility. The Marlin kernel is a highly optimized INT4 GEMM written in CUDA that achieves near-FP16 throughput on Ampere and newer GPUs. It requires:
- Compute capability 80 (Ampere: A100, A10G, etc.)
- 4-bit quantization (not 3-bit or 8-bit)
desc_act=false(no activation reordering)group_sizein (channelwise or standard groups)
If you load a GPTQ model with desc_act=true, vLLM falls back to the generic GPTQ CUDA kernel, which can be 2-3x slower than Marlin. Always prefer desc_act=false models when targeting vLLM deployment.
The Marlin Kernel Deep Dive
Marlin (Mixed Auto-Regressive LINear) is the workhorse kernel for INT4 inference in vLLM. It achieves high utilization through several techniques.
Weight Repacking for Marlin
Marlin expects weights in a specific layout optimized for its tile structure. When vLLM detects a Marlin-eligible model, it repacks the GPTQ/AWQ weights at load time:
class GPTQMarlinLinearMethod(LinearMethodBase):
def create_weights(self, input_size: int, output_size: int,
params_dtype: torch.dtype):
# Marlin uses a different packing layout
# Original GPTQ: [out, in // 8] int32
# Marlin: [in // 16, out * 16 // 8] int32 (transposed + retiled)
qweight = torch.empty(
input_size // 16, output_size * 16 // 8,
dtype=torch.int32
)
scales = torch.empty(
input_size // self.group_size, output_size,
dtype=params_dtype
)
return {"qweight": qweight, "scales": scales}
def apply(self, x: torch.Tensor, qweight: torch.Tensor,
scales: torch.Tensor) -> torch.Tensor:
# Single fused kernel: dequant + GEMM
return marlin_gemm(x, qweight, scales, self.workspace)
The repacking transforms the weight from GPTQ’s row-major INT4 layout into Marlin’s tile-optimized layout. This is a one-time cost at model load, typically taking 2-5 seconds for a 70B model.
Marlin Tile Structure
Marlin uses 16x16 tiles matched to the Tensor Core MMA (Matrix Multiply-Accumulate) instruction size. Each warp processes a output tile, with 4 warps cooperating on a tile:
Thread Block (128 threads = 4 warps):
Warp 0: rows [0:16], cols [0:16]
Warp 1: rows [16:32], cols [0:16]
Warp 2: rows [32:48], cols [0:16]
Warp 3: rows [48:64], cols [0:16]
Each warp executes: mma.sync.aligned.m16n8k16.f16
- Loads 16 INT4 values (2 bytes) per thread for B matrix
- Dequantizes to FP16 in registers
- Executes FP16 Tensor Core MMA
- Accumulates in FP32
The dequantization happens in registers, fused with the GEMM. There is no separate dequantization pass that writes FP16 weights to global memory. This fusion is the primary source of Marlin’s performance advantage.
Pipeline Structure
Marlin uses a double-buffered pipeline to overlap global memory loads with computation:
// Marlin kernel pseudocode (simplified)
__global__ void marlin_gemm_kernel(
const int32_t* __restrict__ B, // packed INT4 weights
const half* __restrict__ A, // FP16 activations
const half* __restrict__ scales, // per-group scales
half* __restrict__ C, // output
int M, int N, int K, int group_size
) {
// Shared memory double buffer
__shared__ half smem_A[2][TILE_M][TILE_K];
__shared__ int32_t smem_B[2][TILE_K / 8][TILE_N];
int stage = 0;
// Prologue: load first tile
load_A_tile(smem_A[0], A, /*...*/);
load_B_tile(smem_B[0], B, /*...*/);
__syncthreads();
for (int k = 0; k < K; k += TILE_K) {
int next = 1 - stage;
// Async load next tile while computing current
if (k + TILE_K < K) {
load_A_tile(smem_A[next], A, /*...*/);
load_B_tile(smem_B[next], B, /*...*/);
}
// Dequantize B from INT4 to FP16 in registers
half frag_B[FRAG_N];
dequant_int4_to_fp16(smem_B[stage], scales, frag_B);
// Tensor Core MMA
mma_sync(frag_C, frag_A, frag_B);
stage = next;
__syncthreads();
}
// Epilogue: store results
store_C_tile(C, frag_C);
}
FP8 Kernel Path
FP8 inference on Hopper (SM90) and Ada Lovelace (SM89) GPUs uses native FP8 Tensor Core instructions. The kernel path is simpler because no bit unpacking is needed.
class Fp8LinearMethod(LinearMethodBase):
def __init__(self, quant_config: dict):
self.activation_scheme = quant_config.get(
"activation_scheme", "dynamic"
)
def apply(self, x: torch.Tensor, weight: torch.Tensor,
weight_scale: torch.Tensor,
input_scale: Optional[torch.Tensor] = None) -> torch.Tensor:
if self.activation_scheme == "dynamic":
# Compute per-tensor scale for activations on the fly
input_scale = x.abs().max() / 448.0 # E4M3 max value
x_fp8 = (x / input_scale).to(torch.float8_e4m3fn)
else:
# Static scale from calibration
x_fp8 = (x / input_scale).to(torch.float8_e4m3fn)
# FP8 GEMM using cuBLAS or cutlass
output = torch._scaled_mm(
x_fp8, weight.t(),
scale_a=input_scale, scale_b=weight_scale,
out_dtype=torch.float16
)
return output
The dynamic activation scheme computes the scaling factor at runtime by finding the absolute maximum of the input tensor. This adds a reduction operation but avoids the need for calibration data. The static scheme uses pre-computed scales from a calibration pass.
Dynamic FP8 quantization adds approximately 15-20 microseconds per linear layer for the abs().max() reduction on typical batch sizes. For a 70B model with 560 linear layers, this totals roughly 8-11 ms per forward pass. Static quantization eliminates this overhead entirely.
FP8 on Pre-Ada Hardware
On Ampere GPUs (A100), FP8 Tensor Cores are not available. vLLM falls back to casting FP8 weights to FP16 at runtime:
class Fp8CastedLinearMethod(LinearMethodBase):
def apply(self, x: torch.Tensor, weight: torch.Tensor,
weight_scale: torch.Tensor) -> torch.Tensor:
# Dequantize weight to FP16
weight_fp16 = weight.to(torch.float16) * weight_scale
return torch.nn.functional.linear(x, weight_fp16)
This loses the compute advantage of FP8 but retains the 2x memory reduction. The dequantization adds overhead, making it generally slower than running with FP16 weights directly for compute-bound shapes.
GPTQ Fallback Kernel
When Marlin is not available (desc_act=true, non-standard group sizes, or pre-Ampere hardware), vLLM uses the ExLlama v2 GPTQ kernel:
class GPTQLinearMethod(LinearMethodBase):
def apply(self, x: torch.Tensor, qweight: torch.Tensor,
qzeros: torch.Tensor, scales: torch.Tensor,
g_idx: Optional[torch.Tensor] = None) -> torch.Tensor:
if g_idx is not None:
# desc_act=true: reorder input by activation magnitude index
x = x[:, g_idx]
output = exllama_v2_gemm(x, qweight, qzeros, scales, self.bits)
return output
The g_idx tensor is the activation reordering permutation. When present, it permutes the columns of the input activation matrix before the GEMM, and the weight matrix was quantized in this permuted order. This reordering prevents the use of Marlin’s fixed tile layout.
Kernel Dispatch at Runtime
During inference, each QuantLinearMethod.apply() call dispatches to the appropriate CUDA kernel. The dispatch is effectively resolved at model load time and stored as a method pointer:
class QuantizedLinearLayer(torch.nn.Module):
def __init__(self, linear_method: LinearMethodBase,
input_size: int, output_size: int):
super().__init__()
self.linear_method = linear_method
self.params = linear_method.create_weights(input_size, output_size)
for name, param in self.params.items():
self.register_parameter(name, torch.nn.Parameter(param))
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.linear_method.apply(x, **self.params)
There is no runtime branching on quantization method during forward passes. The linear_method object is bound at initialization and its apply method is a direct function call.
Mixed Quantization Support
vLLM v1 supports mixed quantization where different layers use different methods. This is useful for models where attention layers are kept at higher precision while MLP layers are aggressively quantized:
# Example: mixed quantization config
mixed_config = {
"default": {"method": "gptq", "bits": 4, "group_size": 128},
"overrides": {
"model.layers.*.self_attn.*": {"method": "fp8"},
"model.layers.*.mlp.*": {"method": "gptq", "bits": 4}
}
}
Each layer gets its own LinearMethodBase instance based on the override pattern. The model loader applies glob matching on parameter names to determine which quantization method to use.
Throughput Benchmarks
The following benchmarks compare kernel performance on an A100-80GB with Llama 70B at various batch sizes.
Llama 70B Decoding Throughput (tokens/sec) — A100-80GB
| Method | Batch 1 | Batch 8 | Batch 32 | Batch 128 | Batch 256 |
|---|---|---|---|---|---|
| FP16 | 38 | 295 | 1,140 | 3,820 | 5,210 |
| FP8 Dynamic | 36 | 280 | 1,180 | 4,450 | 7,680 |
| FP8 Static | 38 | 290 | 1,210 | 4,580 | 7,920 |
| GPTQ-INT4 Marlin | 42 | 330 | 1,320 | 4,890 | 8,150 |
| AWQ-INT4 Marlin | 41 | 325 | 1,310 | 4,860 | 8,090 |
| GPTQ-INT4 ExLlama | 35 | 260 | 980 | 3,410 | 5,560 |
Throughput at Batch Size 256 (tokens/sec)
Key observations:
- At batch size 1, all methods perform similarly because latency is dominated by kernel launch overhead and memory latency, not compute.
- At batch size 128+, INT4 Marlin kernels outperform FP16 by 1.3-1.6x due to reduced memory bandwidth requirements.
- The ExLlama GPTQ kernel is consistently slower than Marlin by 30-45%, making Marlin compatibility essential for throughput-critical deployments.
- FP8 static quantization is within 3% of FP8 dynamic while avoiding the per-layer reduction overhead.
Prefill vs Decode Kernel Behavior
Quantized kernels behave differently during prefill (processing the input prompt) and decode (generating tokens one at a time).
During prefill, the activation matrix x has shape [seq_len, hidden_size] where seq_len can be thousands. This is a large GEMM that is compute-bound and benefits from Tensor Core utilization:
# Prefill: large M dimension
# x shape: [2048, 4096] @ W shape: [4096, 11008] -> [2048, 11008]
# Marlin kernel uses full tile occupancy
# FP8 kernel uses cuBLAS GEMM with high efficiency
During decode, x has shape [batch_size, hidden_size] where batch_size is typically 1-256. Small batch GEMMs are memory-bandwidth-bound:
# Decode: small M dimension
# x shape: [1, 4096] @ W shape: [4096, 11008] -> [1, 11008]
# Memory-bandwidth bound: loading weights dominates
# INT4 loads 4x fewer bytes than FP16 -> up to 4x speedup potential
The memory bandwidth advantage of quantized weights is most pronounced during decode. For a single-token decode on A100 (2 TB/s HBM bandwidth):
The 3.8x ratio approaches the theoretical 4x bandwidth reduction for INT4.
Practical Configuration Guide
When deploying quantized models with vLLM, these configuration decisions matter:
# Maximum throughput: GPTQ-INT4 with Marlin on Ampere+
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-70B-GPTQ \
--quantization gptq \
--dtype float16 \
--max-model-len 4096 \
--tensor-parallel-size 4
# Minimum quality loss: FP8 static on Hopper
python -m vllm.entrypoints.openai.api_server \
--model neuralmagic/Llama-2-70B-FP8 \
--quantization fp8 \
--dtype float16 \
--max-model-len 4096 \
--tensor-parallel-size 2
# Force specific kernel backend
export VLLM_GPTQ_KERNEL=marlin # or exllama
For production deployments targeting both quality and throughput, FP8 static quantization on H100 GPUs provides the best trade-off. Quality degradation is typically less than 0.1% on standard benchmarks while throughput nearly doubles versus FP16. INT4 methods (GPTQ/AWQ) offer higher throughput but with measurable quality impact on reasoning tasks.
Debugging Quantization Issues
Common issues and their diagnostic commands:
# Check which kernel vLLM selected
import vllm
from vllm import LLM
llm = LLM(model="TheBloke/Llama-2-70B-GPTQ")
model = llm.llm_engine.model_executor.driver_worker.model_runner.model
for name, module in model.named_modules():
if hasattr(module, 'linear_method'):
print(f"{name}: {type(module.linear_method).__name__}")
break
# Expected output for Marlin-eligible model:
# model.layers.0.self_attn.qkv_proj: GPTQMarlinLinearMethod
# Check weight shapes after repacking
for name, param in model.named_parameters():
if "qweight" in name:
print(f"{name}: {param.shape} {param.dtype}")
break
If you see GPTQLinearMethod instead of GPTQMarlinLinearMethod, the model is using the slower ExLlama path. Check:
- GPU compute capability:
torch.cuda.get_device_capability()must return (8, 0) - Model config:
desc_actmust befalse - Group size: must be in
Model Load Time by Quantization Method — Llama 70B, 4x A100
| Method | Weight Load (s) | Repack Time (s) | Total Init (s) |
|---|---|---|---|
| FP16 | 45 | 0 | 45 |
| FP8 | 24 | 0 | 24 |
| GPTQ Marlin | 12 | 3.2 | 15.2 |
| GPTQ ExLlama | 12 | 0 | 12 |
| AWQ Marlin | 12 | 2.8 | 14.8 |
The Marlin repack step adds 2-4 seconds to model initialization but is amortized over the lifetime of the serving process. For long-running inference servers, this is negligible.
Summary
vLLM v1’s quantized inference pipeline resolves the method-kernel mapping at model load time, avoiding any runtime branching. GPTQ and AWQ INT4 models are routed through Marlin when hardware and configuration allow, achieving near-FP16 Tensor Core utilization with 4x less memory bandwidth. FP8 models use native Tensor Core instructions on Hopper/Ada and fall back to cast-based dequantization on Ampere. The kernel selection hierarchy is: Marlin (preferred) then ExLlama (GPTQ fallback) then cuBLAS FP8 then cast-to-FP16 (last resort). Always verify your deployment hits the Marlin or native FP8 path for optimal throughput.