INT4 quantization stores each weight in 4 bits instead of 16 (FP16) or 32 (FP32). This 4x or 8x reduction in model size is why INT4 models can run Llama 70B on a single GPU. But 4-bit values do not align to byte boundaries, and no hardware natively loads 4-bit integers from memory. Every INT4 inference engine must solve the packing problem: how to store two 4-bit values in a single byte, load them efficiently, unpack them into a compute-friendly format, apply dequantization (scale and zero-point), and feed the result to matrix multiplication β all without the unpacking overhead becoming the bottleneck.
This post covers the bit-level details of INT4 weight packing, the memory layouts used by GPTQ, AWQ, and Marlin, dequantization kernel implementations, and the performance characteristics of each approach.
INT4 Representation
A signed 4-bit integer represents values from -8 to +7. An unsigned 4-bit integer represents 0 to 15. Most INT4 quantization schemes use unsigned INT4 with a zero-point offset:
where , is the zero-point, and is the scale factor. The scale and zero-point are stored per group of weights (typically 128 weights share one scale/zero-point pair).
// INT4 value ranges
// Unsigned INT4: [0, 15], stored as upper or lower nibble of a uint8
// Signed INT4: [-8, 7], stored as 2's complement in 4 bits
// Packing two INT4 values into one byte
uint8_t pack_uint4(uint8_t low, uint8_t high) {
// low goes in bits [3:0], high goes in bits [7:4]
return (high << 4) | (low & 0x0F);
}
// Unpacking
uint8_t unpack_low(uint8_t packed) {
return packed & 0x0F;
}
uint8_t unpack_high(uint8_t packed) {
return (packed >> 4) & 0x0F;
}
Memory Layout Strategies
The choice of memory layout determines how efficiently the GPU can load and unpack INT4 values during matrix multiplication. There are three primary layouts.
Column-Major Packed (GPTQ Default)
GPTQ packs weights along the input dimension (K). For a weight matrix of shape in INT4, the packed matrix has shape in uint8, or equivalently in uint32 (packing 8 INT4 values per 32-bit word).
Original FP16 weight matrix W[K=128, N=64]:
Row 0: w[0,0] w[0,1] ... w[0,63] (64 FP16 values = 128 bytes)
Row 1: w[1,0] w[1,1] ... w[1,63]
...
Packed INT4 matrix (GPTQ, column-major packing along K):
uint32 packed[K/8][N]:
packed[0][j] = pack(w[0,j], w[1,j], w[2,j], w[3,j],
w[4,j], w[5,j], w[6,j], w[7,j])
// 8 INT4 values from 8 consecutive rows in column j
This layout is natural for the GEMV (matrix-vector multiply) used in autoregressive decoding, where the activation vector is multiplied by each column of the weight matrix.
Row-Major Packed (AWQ)
AWQ packs along the output dimension (N). The packed matrix has shape in uint32:
Packed INT4 matrix (AWQ, row-major packing along N):
uint32 packed[K][N/8]:
packed[i][0] = pack(w[i,0], w[i,1], w[i,2], w[i,3],
w[i,4], w[i,5], w[i,6], w[i,7])
// 8 INT4 values from 8 consecutive columns in row i
Interleaved (Marlin)
The Marlin kernel uses a custom interleaved layout designed for optimal Tensor Core utilization. Weights are reordered to match the access pattern of the mma.sync instruction:
// Marlin interleaved layout for Tensor Core GEMM
// Weights are packed into 128-bit (16-byte) fragments that align
// with the Tensor Core's 16x8x16 matrix multiply shape.
//
// Each 128-bit fragment contains 32 INT4 values arranged as:
// Fragment[i] = { w[r0,c0..c3], w[r1,c0..c3], ..., w[r7,c0..c3] }
// where r0..r7 are 8 rows and c0..c3 are 4 columns,
// matching the Tensor Core's expected operand layout.
INT4 Weight Layout Comparison
| Layout | Packing Direction | Packed Shape (K=4096, N=4096) | Bytes | Best For | Used By |
|---|---|---|---|---|---|
| Column-major | Along K (rows) | [512, 4096] uint32 | 8 MB | GEMV (decode) | GPTQ, ExLlama |
| Row-major | Along N (cols) | [4096, 512] uint32 | 8 MB | GEMM (prefill) | AWQ |
| Interleaved | TC-aligned | [512, 512] uint128 | 8 MB | Tensor Core GEMM | Marlin, CUTLASS |
| FP16 (reference) | N/A | [4096, 4096] fp16 | 32 MB | N/A | Baseline |
Bit Manipulation for Packing and Unpacking
The core operation in INT4 inference is unpacking: extracting two 4-bit values from a byte and converting them to a compute-friendly type (FP16 or BF16) for matrix multiplication.
Packing (Offline, During Quantization)
import numpy as np
def pack_int4_to_uint32(weights_int4):
"""Pack INT4 weights (0-15) into uint32, 8 values per uint32.
Args:
weights_int4: np.array of shape [K, N] with values in [0, 15]
Returns:
packed: np.array of shape [K//8, N] with dtype uint32
"""
K, N = weights_int4.shape
assert K % 8 == 0, "K must be divisible by 8"
packed = np.zeros((K // 8, N), dtype=np.uint32)
for i in range(8):
# Each INT4 value occupies 4 bits at position i*4
packed |= (weights_int4[i::8, :].astype(np.uint32) & 0xF) << (i * 4)
return packed
# Example: pack a 128x64 INT4 matrix
K, N = 128, 64
weights = np.random.randint(0, 16, size=(K, N), dtype=np.uint8)
packed = pack_int4_to_uint32(weights)
print(f"Original: {weights.shape} ({weights.nbytes} bytes)")
print(f"Packed: {packed.shape} ({packed.nbytes} bytes)")
# Original: (128, 64) (8192 bytes)
# Packed: (16, 64) (4096 bytes) -- but uint32 so actually 4096 bytes
# True compression: 8192 bytes / 2 = 4096 bytes (each INT4 = 0.5 bytes)
Unpacking (Online, During Inference)
The GPU must unpack INT4 values at full speed during matrix multiplication. The critical path is: load uint32, extract 8 INT4 values, convert to FP16, multiply by scale, subtract zero-point.
// CUDA kernel: unpack uint32 to 8 half-precision values
__device__ void unpack_uint4x8_to_half8(
uint32_t packed,
half* output, // Must have space for 8 half values
half scale,
half zero_point
) {
#pragma unroll
for (int i = 0; i < 8; i++) {
uint8_t val = (packed >> (i * 4)) & 0xF;
output[i] = __hmul(scale, __hsub(__uint2half_rn(val), zero_point));
}
}
// Optimized version using bit manipulation to avoid per-element shifts
__device__ void unpack_uint4x8_to_half8_fast(
uint32_t packed,
half2* output4, // 4 half2 values = 8 halfs
half scale,
half zero_point
) {
// Extract pairs of INT4 values using masks
uint32_t lo_mask = 0x000F000F; // Extracts bits [3:0] and [19:16]
uint32_t hi_mask = 0x00F000F0; // Extracts bits [7:4] and [23:20]
// First pair: values 0,1
uint32_t pair01 = packed & lo_mask;
// This gives us val0 in bits [3:0] and val2 in bits [19:16]
// Use __byte_perm for efficient nibble extraction on NVIDIA GPUs
uint32_t v0 = __byte_perm(packed, 0, 0x4140); // Extract bytes
uint32_t v1 = __byte_perm(packed, 0, 0x4342);
// Convert and apply scale/zero-point
half2 scale2 = __half2half2(scale);
half2 zp2 = __half2half2(zero_point);
// ... (full implementation uses half2 arithmetic for 2x throughput)
}
The __byte_perm Trick
NVIDIA GPUs have a hardware instruction PRMT (permute bytes) exposed via __byte_perm() that can rearrange bytes within a 64-bit value. This is heavily used in INT4 unpacking:
// __byte_perm(a, b, selector)
// Treats {b, a} as an 8-byte array and selects 4 bytes based on selector
// Each nibble in selector is an index (0-7) into the 8-byte array
// Example: extract the low nibble of each byte in a uint32
__device__ uint32_t extract_low_nibbles(uint32_t packed) {
// Mask to get low nibbles: 0x0F0F0F0F
return packed & 0x0F0F0F0F;
}
// Example: extract the high nibble of each byte, shifted down
__device__ uint32_t extract_high_nibbles(uint32_t packed) {
return (packed >> 4) & 0x0F0F0F0F;
}
On NVIDIA GPUs from Volta onward, half2 operations process two FP16 values in a single instruction. The dequantization kernel must use half2 throughout: __hadd2, __hmul2, __hsub2. Using scalar half operations halves throughput. Every optimized INT4 kernel (GPTQ, AWQ, Marlin) processes pairs of values using half2.
Dequantization Kernel Implementations
GPTQ-Style Dequantization
GPTQ uses per-group asymmetric quantization with uint4 values, FP16 scales, and FP16 zero-points. Group size is typically 128.
// GPTQ dequantization during GEMV
// Weight matrix: packed uint32 [K/8, N]
// Scales: fp16 [K/group_size, N]
// Zeros: packed uint32 [K/group_size/8, N] (zero-points also in INT4)
__global__ void gptq_gemv_kernel(
const uint32_t* __restrict__ weight, // [K/8, N]
const half* __restrict__ scales, // [num_groups, N]
const uint32_t* __restrict__ zeros, // [num_groups/8, N]
const half* __restrict__ input, // [K]
half* __restrict__ output, // [N]
int K, int N, int group_size
) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (col >= N) return;
float acc = 0.0f;
for (int k_packed = 0; k_packed < K / 8; k_packed++) {
uint32_t packed_w = weight[k_packed * N + col];
int group_idx = (k_packed * 8) / group_size;
half scale = scales[group_idx * N + col];
// Unpack zero-point for this group
int zp_packed_idx = group_idx / 8;
int zp_nibble_idx = group_idx % 8;
uint32_t packed_zp = zeros[zp_packed_idx * N + col];
uint8_t zero_point = (packed_zp >> (zp_nibble_idx * 4)) & 0xF;
// Unpack 8 INT4 weights and accumulate
#pragma unroll
for (int i = 0; i < 8; i++) {
uint8_t w_int4 = (packed_w >> (i * 4)) & 0xF;
float w_dequant = __half2float(scale) *
(float(w_int4) - float(zero_point));
int k = k_packed * 8 + i;
acc += w_dequant * __half2float(input[k]);
}
}
output[col] = __float2half(acc);
}
AWQ-Style Dequantization
AWQ uses per-group symmetric quantization and reorders weights to enable vectorized memory loads:
// AWQ processes 8 columns at once using uint32 packed along N
__global__ void awq_gemv_kernel(
const uint32_t* __restrict__ weight, // [K, N/8]
const half* __restrict__ scales, // [K/group_size, N]
const half* __restrict__ input, // [K]
half* __restrict__ output, // [N]
int K, int N, int group_size
) {
// Each thread processes 8 output columns
int col_group = blockIdx.x * blockDim.x + threadIdx.x;
if (col_group >= N / 8) return;
float acc[8] = {0.0f};
for (int k = 0; k < K; k++) {
uint32_t packed_w = weight[k * (N / 8) + col_group];
int group_idx = k / group_size;
float x = __half2float(input[k]);
#pragma unroll
for (int i = 0; i < 8; i++) {
uint8_t w_int4 = (packed_w >> (i * 4)) & 0xF;
int col = col_group * 8 + i;
float scale = __half2float(scales[group_idx * N + col]);
float w_dequant = scale * (float(w_int4) - 8.0f); // Symmetric
acc[i] += w_dequant * x;
}
}
for (int i = 0; i < 8; i++) {
output[col_group * 8 + i] = __float2half(acc[i]);
}
}
Marlin Kernel (State-of-the-Art)
The Marlin kernel achieves near-theoretical throughput by fusing dequantization with Tensor Core matrix multiply. Key techniques:
- Asynchronous memory loads: Uses
cp.asyncto overlap global memory loads with computation. - Shared memory staging: Loads packed INT4 data into shared memory, unpacks to registers, then feeds to Tensor Cores.
- Warp-level shuffles: Uses
__shfl_syncto redistribute unpacked values across threads in a warp.
// Marlin kernel structure (simplified)
// Full implementation: ~800 lines of CUDA with heavy use of PTX inline assembly
__global__ void marlin_gemm_kernel(
const uint32_t* __restrict__ A_packed, // INT4 weights, interleaved layout
const half* __restrict__ B, // FP16 activations
half* __restrict__ C, // Output
const half* __restrict__ scales,
int M, int N, int K
) {
// Tile sizes tuned for H100 Tensor Cores
constexpr int TILE_M = 16;
constexpr int TILE_N = 256;
constexpr int TILE_K = 64;
extern __shared__ char smem[];
half* smem_A = reinterpret_cast<half*>(smem); // Unpacked weights
half* smem_B = smem_A + TILE_M * TILE_K; // Activations
// Main loop over K dimension
for (int k_tile = 0; k_tile < K; k_tile += TILE_K) {
// Stage 1: Async load packed INT4 weights to shared memory
// Stage 2: Unpack INT4 -> FP16 in shared memory
// Stage 3: Load activation tile to shared memory
// Stage 4: Execute Tensor Core mma.sync
// (Stages are pipelined across iterations)
__syncthreads();
// Tensor Core matrix multiply on unpacked FP16 data
// Uses wmma::mma_sync or inline PTX mma instructions
}
// Write results to global memory
}
INT4 GEMV Throughput (Llama 70B Decode, Batch=1, H100)
(tokens/sec)Group Quantization Memory Overhead
INT4 weights are not purely 4 bits per weight because each group requires a scale (FP16, 2 bytes) and optionally a zero-point (INT4, 0.5 bytes). The effective bits per weight is:
where is the group size and is the zero-point storage (0 for symmetric, 4 for asymmetric).
Effective Bits per Weight by Group Size
| Group Size | Scale Overhead (bits/weight) | Zero-Point Overhead | Effective Bits/Weight | Compression vs FP16 |
|---|---|---|---|---|
| 32 | 0.5 | 0.125 (asymmetric) | 4.625 | 3.46x |
| 64 | 0.25 | 0.0625 | 4.3125 | 3.71x |
| 128 | 0.125 | 0.03125 | 4.156 | 3.85x |
| 256 | 0.0625 | 0.015625 | 4.078 | 3.92x |
| Channel-wise | ~0.004 | ~0.001 | 4.005 | 3.99x |
def model_size_int4(num_params, group_size=128, asymmetric=True):
"""Calculate INT4 model size in bytes including scale/zero overhead."""
# Weight storage: 4 bits per param = 0.5 bytes per param
weight_bytes = num_params * 0.5
# Scale storage: 1 FP16 per group = 2 bytes per group
num_groups = num_params / group_size
scale_bytes = num_groups * 2
# Zero-point storage: 1 INT4 per group = 0.5 bytes per group
zero_bytes = num_groups * 0.5 if asymmetric else 0
total = weight_bytes + scale_bytes + zero_bytes
effective_bits = (total * 8) / num_params
compression = 16 / effective_bits # vs FP16
return {
'total_bytes': total,
'total_gb': total / (1024**3),
'effective_bits': effective_bits,
'compression_ratio': compression
}
# Llama 70B: ~70 billion parameters
result = model_size_int4(70e9, group_size=128)
print(f"Llama 70B INT4 (g=128): {result['total_gb']:.1f} GB, "
f"{result['effective_bits']:.2f} bits/param, "
f"{result['compression_ratio']:.2f}x compression")
# Llama 70B INT4 (g=128): 36.3 GB, 4.16 bits/param, 3.85x compression
Memory Access Patterns and Coalescing
INT4 packed data has unique coalescing requirements. A warp of 32 threads performing a 128-byte coalesced load fetches 32 uint32 values = 256 INT4 values. This is 8x more weights per load than FP16, which is the fundamental source of INT4βs memory bandwidth advantage.
// Coalesced load: 32 threads each load 4 bytes (uint32) = 128 bytes
// This fetches 256 INT4 weights in a single memory transaction
// Non-coalesced access pattern (bad):
// Thread i loads weight[i * stride] where stride > 1
// This generates multiple memory transactions
// Coalesced access pattern (good):
// Thread i loads packed_weight[base + i]
// All 32 threads access consecutive uint32 addresses
The memory bandwidth utilization for INT4 vs FP16:
Effective Memory Bandwidth Utilization (GEMV, H100)
(GB/s)Marlin achieves higher bandwidth utilization because its interleaved layout eliminates bank conflicts in shared memory during the unpack stage, and its async pipeline keeps the memory subsystem continuously fed.
Dequantization Fusion Strategies
Standalone dequantization (unpack, scale, convert) followed by a separate GEMM is slow because it requires writing the full FP16 matrix to memory and reading it back. All production INT4 kernels fuse dequantization with the matrix multiply.
Fusion Approaches
Register-level fusion: Dequantize in registers immediately before the Tensor Core instruction. The FP16 values never touch shared or global memory.
// Register-level dequant + Tensor Core fusion (pseudocode)
// Load packed INT4 from shared memory to registers
uint32_t packed = smem_weights[lane_id];
// Dequantize to FP16 in registers
half2 w01 = dequant_pair(packed, 0, scale, zero);
half2 w23 = dequant_pair(packed, 1, scale, zero);
half2 w45 = dequant_pair(packed, 2, scale, zero);
half2 w67 = dequant_pair(packed, 3, scale, zero);
// Feed directly to Tensor Core mma (inline PTX)
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
"{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};"
: "=f"(c0), "=f"(c1), "=f"(c2), "=f"(c3)
: "r"(*(uint32_t*)&w01), "r"(*(uint32_t*)&w23),
"r"(*(uint32_t*)&w45), "r"(*(uint32_t*)&w67),
"r"(act_frag0), "r"(act_frag1),
"f"(c0), "f"(c1), "f"(c2), "f"(c3)
);
Shared memory fusion: Dequantize while loading from global to shared memory. The shared memory contains FP16 values ready for Tensor Core consumption. This uses 2x more shared memory than keeping values packed.
Mixed approach: Load packed INT4 to shared memory (saving bandwidth and shared memory), dequantize to registers before Tensor Core execution.
Dequantization Fusion Impact (4096x4096 INT4 GEMM, H100)
| Approach | Time (us) | TFLOPS (effective) | Memory Traffic | Limitation |
|---|---|---|---|---|
| Separate dequant + cuBLAS FP16 GEMM | 284 | 47 | 2x (dequant output + GEMM input) | Memory bound on dequant write |
| Shared memory fusion | 142 | 94 | 1x (packed load only) | 2x shared memory usage |
| Register fusion (Marlin) | 98 | 136 | 1x (packed load only) | Complex register management |
| FP16 cuBLAS (reference) | 88 | 152 | 1x | No quantization overhead |
INT4 on Tensor Cores
Starting with Hopper (H100), NVIDIA Tensor Cores natively support FP8 but not INT4. INT4 values must be upcast to FP16 or FP8 before Tensor Core execution. However, Blackwell (B200) introduces native INT4 Tensor Core support via the mma.sync instruction with .s4 or .u4 type qualifiers.
// Hopper: INT4 must be dequantized to FP16 before Tensor Core
// Blackwell: native INT4 Tensor Core support
#if __CUDA_ARCH__ >= 1000 // Blackwell
// Direct INT4 x INT4 Tensor Core multiply
asm volatile(
"mma.sync.aligned.m16n8k64.row.col.s32.s4.s4.s32 "
"{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};"
: ...
);
#else // Hopper and earlier: dequantize to FP16 first
// Unpack INT4 -> FP16, then use FP16 Tensor Core
half2 weights_fp16 = dequant_int4_to_fp16(packed_int4);
// ... FP16 mma.sync
#endif
INT4 Tensor Core Throughput by Architecture
(TOPS (INT4))Practical Packing Implementation
A complete packing pipeline for deploying a quantized model:
import torch
import struct
def pack_model_weights_int4(
state_dict,
scales_dict,
zeros_dict,
group_size=128
):
"""Pack quantized INT4 weights for GPU inference.
Args:
state_dict: dict of {layer_name: int4_weights} (values 0-15, shape [K, N])
scales_dict: dict of {layer_name: fp16_scales} (shape [K//group_size, N])
zeros_dict: dict of {layer_name: int4_zeros} (values 0-15, shape [K//group_size, N])
Returns:
packed_dict: dict ready for GPU kernel consumption
"""
packed_dict = {}
for name, weights in state_dict.items():
K, N = weights.shape
assert K % 8 == 0
# Pack 8 INT4 values into each uint32
weights_uint8 = weights.to(torch.uint8)
packed = torch.zeros(K // 8, N, dtype=torch.int32, device=weights.device)
for i in range(8):
packed |= (weights_uint8[i::8, :].to(torch.int32) & 0xF) << (i * 4)
packed_dict[f"{name}.qweight"] = packed
packed_dict[f"{name}.scales"] = scales_dict[name].to(torch.float16)
# Pack zero-points similarly
zp = zeros_dict[name]
num_groups = zp.shape[0]
if num_groups % 8 != 0:
# Pad to multiple of 8
pad_rows = 8 - (num_groups % 8)
zp = torch.cat([zp, torch.zeros(pad_rows, N, dtype=zp.dtype,
device=zp.device)])
num_groups = zp.shape[0]
packed_zp = torch.zeros(num_groups // 8, N, dtype=torch.int32,
device=zp.device)
for i in range(8):
packed_zp |= (zp[i::8, :].to(torch.int32) & 0xF) << (i * 4)
packed_dict[f"{name}.qzeros"] = packed_zp
return packed_dict
# Verification: unpack and check round-trip
def verify_packing(original, packed, K, N):
"""Verify that packing is lossless."""
for k_packed in range(K // 8):
for n in range(N):
val = packed[k_packed, n].item()
for i in range(8):
unpacked = (val >> (i * 4)) & 0xF
expected = original[k_packed * 8 + i, n].item()
assert unpacked == expected, (
f"Mismatch at [{k_packed*8+i}, {n}]: "
f"got {unpacked}, expected {expected}"
)
print("Packing verification passed!")
INT4 packing is sensitive to byte order. NVIDIA GPUs are little-endian, so the least significant nibble of a uint32 corresponds to the first INT4 value. If you pack on a big-endian CPU (rare but possible in some ARM configurations), you must byte-swap before uploading to the GPU. All major frameworks (PyTorch, TensorFlow) handle this automatically, but custom packing code must be careful.
Performance Analysis
The end-to-end performance of INT4 inference depends on the balance between reduced memory traffic (4x less data) and dequantization overhead.
For decode (batch=1, GEMV): INT4 is strictly better because GEMV is entirely memory-bandwidth bound. The 4x reduction in weight reads directly translates to ~3.5x speedup (accounting for scale/zero overhead).
For prefill (large batch, GEMM): INT4 may be slower than FP16 for large batch sizes where the GEMM becomes compute-bound. The dequantization overhead reduces effective FLOPS.
INT4 vs FP16 Performance Across Batch Sizes (Llama 70B, H100)
| Batch Size | FP16 (ms) | INT4 Marlin (ms) | INT4 Speedup | Bottleneck |
|---|---|---|---|---|
| 1 | 38.2 | 11.1 | 3.44x | Memory BW (both) |
| 4 | 39.5 | 12.8 | 3.09x | Memory BW (both) |
| 16 | 45.1 | 18.2 | 2.48x | Mixed |
| 64 | 82.3 | 52.1 | 1.58x | Compute (INT4 dequant overhead) |
| 128 | 148.6 | 108.4 | 1.37x | Compute (both) |
| 256 | 285.2 | 242.8 | 1.17x | Compute (both, INT4 overhead visible) |
Summary
INT4 weight packing is the foundation of quantized LLM serving. Two INT4 values share a single byte, requiring bit manipulation for packing (offline) and unpacking (every inference step). The memory layout β column-major (GPTQ), row-major (AWQ), or interleaved (Marlin) β determines how efficiently the GPU loads packed data. Dequantization must be fused with matrix multiplication to avoid writing intermediate FP16 values to memory. The Marlin kernel achieves 87-89% of FP16 cuBLAS throughput while loading 4x less data, making INT4 the dominant format for single-GPU LLM serving. Blackwellβs native INT4 Tensor Core support will eliminate the dequantization overhead entirely, closing the remaining gap to theoretical peak throughput.