Part of Series Quantization Masterclass 26 of 30
1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest β€” Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier β€” Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 β€” How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization β€” The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization β€” OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment

INT4 quantization stores each weight in 4 bits instead of 16 (FP16) or 32 (FP32). This 4x or 8x reduction in model size is why INT4 models can run Llama 70B on a single GPU. But 4-bit values do not align to byte boundaries, and no hardware natively loads 4-bit integers from memory. Every INT4 inference engine must solve the packing problem: how to store two 4-bit values in a single byte, load them efficiently, unpack them into a compute-friendly format, apply dequantization (scale and zero-point), and feed the result to matrix multiplication β€” all without the unpacking overhead becoming the bottleneck.

This post covers the bit-level details of INT4 weight packing, the memory layouts used by GPTQ, AWQ, and Marlin, dequantization kernel implementations, and the performance characteristics of each approach.

INT4 Representation

A signed 4-bit integer represents values from -8 to +7. An unsigned 4-bit integer represents 0 to 15. Most INT4 quantization schemes use unsigned INT4 with a zero-point offset:

wdequant=(wint4βˆ’z)Γ—sw_{\text{dequant}} = (w_{\text{int4}} - z) \times s

where wint4∈{0,1,...,15}w_{\text{int4}} \in \{0, 1, ..., 15\}, zz is the zero-point, and ss is the scale factor. The scale and zero-point are stored per group of weights (typically 128 weights share one scale/zero-point pair).

// INT4 value ranges
// Unsigned INT4: [0, 15], stored as upper or lower nibble of a uint8
// Signed INT4:   [-8, 7], stored as 2's complement in 4 bits

// Packing two INT4 values into one byte
uint8_t pack_uint4(uint8_t low, uint8_t high) {
    // low goes in bits [3:0], high goes in bits [7:4]
    return (high << 4) | (low & 0x0F);
}

// Unpacking
uint8_t unpack_low(uint8_t packed) {
    return packed & 0x0F;
}

uint8_t unpack_high(uint8_t packed) {
    return (packed >> 4) & 0x0F;
}

Memory Layout Strategies

The choice of memory layout determines how efficiently the GPU can load and unpack INT4 values during matrix multiplication. There are three primary layouts.

Column-Major Packed (GPTQ Default)

GPTQ packs weights along the input dimension (K). For a weight matrix WW of shape [K,N][K, N] in INT4, the packed matrix has shape [K/2,N][K/2, N] in uint8, or equivalently [K/8,N][K/8, N] in uint32 (packing 8 INT4 values per 32-bit word).

Original FP16 weight matrix W[K=128, N=64]:
  Row 0: w[0,0] w[0,1] ... w[0,63]   (64 FP16 values = 128 bytes)
  Row 1: w[1,0] w[1,1] ... w[1,63]
  ...

Packed INT4 matrix (GPTQ, column-major packing along K):
  uint32 packed[K/8][N]:
  packed[0][j] = pack(w[0,j], w[1,j], w[2,j], w[3,j],
                       w[4,j], w[5,j], w[6,j], w[7,j])
  // 8 INT4 values from 8 consecutive rows in column j

This layout is natural for the GEMV (matrix-vector multiply) used in autoregressive decoding, where the activation vector is multiplied by each column of the weight matrix.

Row-Major Packed (AWQ)

AWQ packs along the output dimension (N). The packed matrix has shape [K,N/8][K, N/8] in uint32:

Packed INT4 matrix (AWQ, row-major packing along N):
  uint32 packed[K][N/8]:
  packed[i][0] = pack(w[i,0], w[i,1], w[i,2], w[i,3],
                       w[i,4], w[i,5], w[i,6], w[i,7])
  // 8 INT4 values from 8 consecutive columns in row i

Interleaved (Marlin)

The Marlin kernel uses a custom interleaved layout designed for optimal Tensor Core utilization. Weights are reordered to match the access pattern of the mma.sync instruction:

// Marlin interleaved layout for Tensor Core GEMM
// Weights are packed into 128-bit (16-byte) fragments that align
// with the Tensor Core's 16x8x16 matrix multiply shape.
//
// Each 128-bit fragment contains 32 INT4 values arranged as:
// Fragment[i] = { w[r0,c0..c3], w[r1,c0..c3], ..., w[r7,c0..c3] }
// where r0..r7 are 8 rows and c0..c3 are 4 columns,
// matching the Tensor Core's expected operand layout.
πŸ“Š

INT4 Weight Layout Comparison

LayoutPacking DirectionPacked Shape (K=4096, N=4096)BytesBest ForUsed By
Column-major Along K (rows) [512, 4096] uint32 8 MB GEMV (decode) GPTQ, ExLlama
Row-major Along N (cols) [4096, 512] uint32 8 MB GEMM (prefill) AWQ
Interleaved TC-aligned [512, 512] uint128 8 MB Tensor Core GEMM Marlin, CUTLASS
FP16 (reference) N/A [4096, 4096] fp16 32 MB N/A Baseline
Note: All INT4 layouts store the same 4096x4096 matrix in 8 MB (4x compression from FP16). The difference is access pattern efficiency.

Bit Manipulation for Packing and Unpacking

The core operation in INT4 inference is unpacking: extracting two 4-bit values from a byte and converting them to a compute-friendly type (FP16 or BF16) for matrix multiplication.

Packing (Offline, During Quantization)

import numpy as np

def pack_int4_to_uint32(weights_int4):
    """Pack INT4 weights (0-15) into uint32, 8 values per uint32.

    Args:
        weights_int4: np.array of shape [K, N] with values in [0, 15]
    Returns:
        packed: np.array of shape [K//8, N] with dtype uint32
    """
    K, N = weights_int4.shape
    assert K % 8 == 0, "K must be divisible by 8"
    packed = np.zeros((K // 8, N), dtype=np.uint32)

    for i in range(8):
        # Each INT4 value occupies 4 bits at position i*4
        packed |= (weights_int4[i::8, :].astype(np.uint32) & 0xF) << (i * 4)

    return packed

# Example: pack a 128x64 INT4 matrix
K, N = 128, 64
weights = np.random.randint(0, 16, size=(K, N), dtype=np.uint8)
packed = pack_int4_to_uint32(weights)
print(f"Original: {weights.shape} ({weights.nbytes} bytes)")
print(f"Packed:   {packed.shape} ({packed.nbytes} bytes)")
# Original: (128, 64) (8192 bytes)
# Packed:   (16, 64) (4096 bytes) -- but uint32 so actually 4096 bytes
# True compression: 8192 bytes / 2 = 4096 bytes (each INT4 = 0.5 bytes)

Unpacking (Online, During Inference)

The GPU must unpack INT4 values at full speed during matrix multiplication. The critical path is: load uint32, extract 8 INT4 values, convert to FP16, multiply by scale, subtract zero-point.

// CUDA kernel: unpack uint32 to 8 half-precision values
__device__ void unpack_uint4x8_to_half8(
    uint32_t packed,
    half* output,  // Must have space for 8 half values
    half scale,
    half zero_point
) {
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        uint8_t val = (packed >> (i * 4)) & 0xF;
        output[i] = __hmul(scale, __hsub(__uint2half_rn(val), zero_point));
    }
}

// Optimized version using bit manipulation to avoid per-element shifts
__device__ void unpack_uint4x8_to_half8_fast(
    uint32_t packed,
    half2* output4,  // 4 half2 values = 8 halfs
    half scale,
    half zero_point
) {
    // Extract pairs of INT4 values using masks
    uint32_t lo_mask = 0x000F000F;  // Extracts bits [3:0] and [19:16]
    uint32_t hi_mask = 0x00F000F0;  // Extracts bits [7:4] and [23:20]

    // First pair: values 0,1
    uint32_t pair01 = packed & lo_mask;
    // This gives us val0 in bits [3:0] and val2 in bits [19:16]

    // Use __byte_perm for efficient nibble extraction on NVIDIA GPUs
    uint32_t v0 = __byte_perm(packed, 0, 0x4140);  // Extract bytes
    uint32_t v1 = __byte_perm(packed, 0, 0x4342);

    // Convert and apply scale/zero-point
    half2 scale2 = __half2half2(scale);
    half2 zp2 = __half2half2(zero_point);

    // ... (full implementation uses half2 arithmetic for 2x throughput)
}

The __byte_perm Trick

NVIDIA GPUs have a hardware instruction PRMT (permute bytes) exposed via __byte_perm() that can rearrange bytes within a 64-bit value. This is heavily used in INT4 unpacking:

// __byte_perm(a, b, selector)
// Treats {b, a} as an 8-byte array and selects 4 bytes based on selector
// Each nibble in selector is an index (0-7) into the 8-byte array

// Example: extract the low nibble of each byte in a uint32
__device__ uint32_t extract_low_nibbles(uint32_t packed) {
    // Mask to get low nibbles: 0x0F0F0F0F
    return packed & 0x0F0F0F0F;
}

// Example: extract the high nibble of each byte, shifted down
__device__ uint32_t extract_high_nibbles(uint32_t packed) {
    return (packed >> 4) & 0x0F0F0F0F;
}
⚑ Half2 Arithmetic Is Mandatory

On NVIDIA GPUs from Volta onward, half2 operations process two FP16 values in a single instruction. The dequantization kernel must use half2 throughout: __hadd2, __hmul2, __hsub2. Using scalar half operations halves throughput. Every optimized INT4 kernel (GPTQ, AWQ, Marlin) processes pairs of values using half2.

Dequantization Kernel Implementations

GPTQ-Style Dequantization

GPTQ uses per-group asymmetric quantization with uint4 values, FP16 scales, and FP16 zero-points. Group size is typically 128.

// GPTQ dequantization during GEMV
// Weight matrix: packed uint32 [K/8, N]
// Scales: fp16 [K/group_size, N]
// Zeros: packed uint32 [K/group_size/8, N] (zero-points also in INT4)
__global__ void gptq_gemv_kernel(
    const uint32_t* __restrict__ weight,    // [K/8, N]
    const half* __restrict__ scales,         // [num_groups, N]
    const uint32_t* __restrict__ zeros,      // [num_groups/8, N]
    const half* __restrict__ input,          // [K]
    half* __restrict__ output,               // [N]
    int K, int N, int group_size
) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (col >= N) return;

    float acc = 0.0f;

    for (int k_packed = 0; k_packed < K / 8; k_packed++) {
        uint32_t packed_w = weight[k_packed * N + col];
        int group_idx = (k_packed * 8) / group_size;
        half scale = scales[group_idx * N + col];

        // Unpack zero-point for this group
        int zp_packed_idx = group_idx / 8;
        int zp_nibble_idx = group_idx % 8;
        uint32_t packed_zp = zeros[zp_packed_idx * N + col];
        uint8_t zero_point = (packed_zp >> (zp_nibble_idx * 4)) & 0xF;

        // Unpack 8 INT4 weights and accumulate
        #pragma unroll
        for (int i = 0; i < 8; i++) {
            uint8_t w_int4 = (packed_w >> (i * 4)) & 0xF;
            float w_dequant = __half2float(scale) *
                              (float(w_int4) - float(zero_point));
            int k = k_packed * 8 + i;
            acc += w_dequant * __half2float(input[k]);
        }
    }

    output[col] = __float2half(acc);
}

AWQ-Style Dequantization

AWQ uses per-group symmetric quantization and reorders weights to enable vectorized memory loads:

// AWQ processes 8 columns at once using uint32 packed along N
__global__ void awq_gemv_kernel(
    const uint32_t* __restrict__ weight,    // [K, N/8]
    const half* __restrict__ scales,         // [K/group_size, N]
    const half* __restrict__ input,          // [K]
    half* __restrict__ output,               // [N]
    int K, int N, int group_size
) {
    // Each thread processes 8 output columns
    int col_group = blockIdx.x * blockDim.x + threadIdx.x;
    if (col_group >= N / 8) return;

    float acc[8] = {0.0f};

    for (int k = 0; k < K; k++) {
        uint32_t packed_w = weight[k * (N / 8) + col_group];
        int group_idx = k / group_size;
        float x = __half2float(input[k]);

        #pragma unroll
        for (int i = 0; i < 8; i++) {
            uint8_t w_int4 = (packed_w >> (i * 4)) & 0xF;
            int col = col_group * 8 + i;
            float scale = __half2float(scales[group_idx * N + col]);
            float w_dequant = scale * (float(w_int4) - 8.0f);  // Symmetric
            acc[i] += w_dequant * x;
        }
    }

    for (int i = 0; i < 8; i++) {
        output[col_group * 8 + i] = __float2half(acc[i]);
    }
}

Marlin Kernel (State-of-the-Art)

The Marlin kernel achieves near-theoretical throughput by fusing dequantization with Tensor Core matrix multiply. Key techniques:

  1. Asynchronous memory loads: Uses cp.async to overlap global memory loads with computation.
  2. Shared memory staging: Loads packed INT4 data into shared memory, unpacks to registers, then feeds to Tensor Cores.
  3. Warp-level shuffles: Uses __shfl_sync to redistribute unpacked values across threads in a warp.
// Marlin kernel structure (simplified)
// Full implementation: ~800 lines of CUDA with heavy use of PTX inline assembly
__global__ void marlin_gemm_kernel(
    const uint32_t* __restrict__ A_packed,  // INT4 weights, interleaved layout
    const half* __restrict__ B,              // FP16 activations
    half* __restrict__ C,                    // Output
    const half* __restrict__ scales,
    int M, int N, int K
) {
    // Tile sizes tuned for H100 Tensor Cores
    constexpr int TILE_M = 16;
    constexpr int TILE_N = 256;
    constexpr int TILE_K = 64;

    extern __shared__ char smem[];
    half* smem_A = reinterpret_cast<half*>(smem);           // Unpacked weights
    half* smem_B = smem_A + TILE_M * TILE_K;                // Activations

    // Main loop over K dimension
    for (int k_tile = 0; k_tile < K; k_tile += TILE_K) {
        // Stage 1: Async load packed INT4 weights to shared memory
        // Stage 2: Unpack INT4 -> FP16 in shared memory
        // Stage 3: Load activation tile to shared memory
        // Stage 4: Execute Tensor Core mma.sync
        // (Stages are pipelined across iterations)
        __syncthreads();

        // Tensor Core matrix multiply on unpacked FP16 data
        // Uses wmma::mma_sync or inline PTX mma instructions
    }

    // Write results to global memory
}

INT4 GEMV Throughput (Llama 70B Decode, Batch=1, H100)

(tokens/sec)
GPTQ (ExLlama v2)
42 tokens/sec
AWQ (AutoAWQ)
45 tokens/sec
Marlin (FP16 act x INT4 wt) 1.5-1.6x faster
68 tokens/sec
FP16 baseline Needs 2 GPUs
24 tokens/sec

Group Quantization Memory Overhead

INT4 weights are not purely 4 bits per weight because each group requires a scale (FP16, 2 bytes) and optionally a zero-point (INT4, 0.5 bytes). The effective bits per weight is:

EffectiveΒ bits=4+16+zbitsg\text{Effective bits} = 4 + \frac{16 + z_{\text{bits}}}{g}

where gg is the group size and zbitsz_{\text{bits}} is the zero-point storage (0 for symmetric, 4 for asymmetric).

πŸ“Š

Effective Bits per Weight by Group Size

Group SizeScale Overhead (bits/weight)Zero-Point OverheadEffective Bits/WeightCompression vs FP16
32 0.5 0.125 (asymmetric) 4.625 3.46x
64 0.25 0.0625 4.3125 3.71x
128 0.125 0.03125 4.156 3.85x
256 0.0625 0.015625 4.078 3.92x
Channel-wise ~0.004 ~0.001 4.005 3.99x
Note: Group size 128 is the most common default. The overhead from scales/zeros is small (3.9% at g=128) but not zero.
def model_size_int4(num_params, group_size=128, asymmetric=True):
    """Calculate INT4 model size in bytes including scale/zero overhead."""
    # Weight storage: 4 bits per param = 0.5 bytes per param
    weight_bytes = num_params * 0.5

    # Scale storage: 1 FP16 per group = 2 bytes per group
    num_groups = num_params / group_size
    scale_bytes = num_groups * 2

    # Zero-point storage: 1 INT4 per group = 0.5 bytes per group
    zero_bytes = num_groups * 0.5 if asymmetric else 0

    total = weight_bytes + scale_bytes + zero_bytes
    effective_bits = (total * 8) / num_params
    compression = 16 / effective_bits  # vs FP16

    return {
        'total_bytes': total,
        'total_gb': total / (1024**3),
        'effective_bits': effective_bits,
        'compression_ratio': compression
    }

# Llama 70B: ~70 billion parameters
result = model_size_int4(70e9, group_size=128)
print(f"Llama 70B INT4 (g=128): {result['total_gb']:.1f} GB, "
      f"{result['effective_bits']:.2f} bits/param, "
      f"{result['compression_ratio']:.2f}x compression")
# Llama 70B INT4 (g=128): 36.3 GB, 4.16 bits/param, 3.85x compression

Memory Access Patterns and Coalescing

INT4 packed data has unique coalescing requirements. A warp of 32 threads performing a 128-byte coalesced load fetches 32 uint32 values = 256 INT4 values. This is 8x more weights per load than FP16, which is the fundamental source of INT4’s memory bandwidth advantage.

// Coalesced load: 32 threads each load 4 bytes (uint32) = 128 bytes
// This fetches 256 INT4 weights in a single memory transaction

// Non-coalesced access pattern (bad):
// Thread i loads weight[i * stride] where stride > 1
// This generates multiple memory transactions

// Coalesced access pattern (good):
// Thread i loads packed_weight[base + i]
// All 32 threads access consecutive uint32 addresses

The memory bandwidth utilization for INT4 vs FP16:

Effective Memory Bandwidth Utilization (GEMV, H100)

(GB/s)
FP16 GEMV 63% of peak 3.35 TB/s
2,100 GB/s
INT4 GEMV (GPTQ) 72% of peak
2,400 GB/s
INT4 GEMV (Marlin) 87% of peak
2,900 GB/s
HBM3 peak (H100)
3,350 GB/s

Marlin achieves higher bandwidth utilization because its interleaved layout eliminates bank conflicts in shared memory during the unpack stage, and its async pipeline keeps the memory subsystem continuously fed.

Dequantization Fusion Strategies

Standalone dequantization (unpack, scale, convert) followed by a separate GEMM is slow because it requires writing the full FP16 matrix to memory and reading it back. All production INT4 kernels fuse dequantization with the matrix multiply.

Fusion Approaches

Register-level fusion: Dequantize in registers immediately before the Tensor Core instruction. The FP16 values never touch shared or global memory.

// Register-level dequant + Tensor Core fusion (pseudocode)
// Load packed INT4 from shared memory to registers
uint32_t packed = smem_weights[lane_id];

// Dequantize to FP16 in registers
half2 w01 = dequant_pair(packed, 0, scale, zero);
half2 w23 = dequant_pair(packed, 1, scale, zero);
half2 w45 = dequant_pair(packed, 2, scale, zero);
half2 w67 = dequant_pair(packed, 3, scale, zero);

// Feed directly to Tensor Core mma (inline PTX)
asm volatile(
    "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
    "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};"
    : "=f"(c0), "=f"(c1), "=f"(c2), "=f"(c3)
    : "r"(*(uint32_t*)&w01), "r"(*(uint32_t*)&w23),
      "r"(*(uint32_t*)&w45), "r"(*(uint32_t*)&w67),
      "r"(act_frag0), "r"(act_frag1),
      "f"(c0), "f"(c1), "f"(c2), "f"(c3)
);

Shared memory fusion: Dequantize while loading from global to shared memory. The shared memory contains FP16 values ready for Tensor Core consumption. This uses 2x more shared memory than keeping values packed.

Mixed approach: Load packed INT4 to shared memory (saving bandwidth and shared memory), dequantize to registers before Tensor Core execution.

πŸ“Š

Dequantization Fusion Impact (4096x4096 INT4 GEMM, H100)

ApproachTime (us)TFLOPS (effective)Memory TrafficLimitation
Separate dequant + cuBLAS FP16 GEMM 284 47 2x (dequant output + GEMM input) Memory bound on dequant write
Shared memory fusion 142 94 1x (packed load only) 2x shared memory usage
Register fusion (Marlin) 98 136 1x (packed load only) Complex register management
FP16 cuBLAS (reference) 88 152 1x No quantization overhead
Note: Register fusion achieves 89% of FP16 cuBLAS throughput while processing 4x less data. The 11% gap is the dequantization overhead.

INT4 on Tensor Cores

Starting with Hopper (H100), NVIDIA Tensor Cores natively support FP8 but not INT4. INT4 values must be upcast to FP16 or FP8 before Tensor Core execution. However, Blackwell (B200) introduces native INT4 Tensor Core support via the mma.sync instruction with .s4 or .u4 type qualifiers.

// Hopper: INT4 must be dequantized to FP16 before Tensor Core
// Blackwell: native INT4 Tensor Core support
#if __CUDA_ARCH__ >= 1000  // Blackwell
    // Direct INT4 x INT4 Tensor Core multiply
    asm volatile(
        "mma.sync.aligned.m16n8k64.row.col.s32.s4.s4.s32 "
        "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};"
        : ...
    );
#else  // Hopper and earlier: dequantize to FP16 first
    // Unpack INT4 -> FP16, then use FP16 Tensor Core
    half2 weights_fp16 = dequant_int4_to_fp16(packed_int4);
    // ... FP16 mma.sync
#endif

INT4 Tensor Core Throughput by Architecture

(TOPS (INT4))
A100 (FP16 TC + dequant)
156 TOPS (INT4)
H100 (FP16 TC + dequant)
495 TOPS (INT4)
H100 (FP8 TC + dequant) FP8 path faster
990 TOPS (INT4)
B200 (native INT4 TC) 4.5x vs H100 FP8 path
4,500 TOPS (INT4)

Practical Packing Implementation

A complete packing pipeline for deploying a quantized model:

import torch
import struct

def pack_model_weights_int4(
    state_dict,
    scales_dict,
    zeros_dict,
    group_size=128
):
    """Pack quantized INT4 weights for GPU inference.

    Args:
        state_dict: dict of {layer_name: int4_weights} (values 0-15, shape [K, N])
        scales_dict: dict of {layer_name: fp16_scales} (shape [K//group_size, N])
        zeros_dict: dict of {layer_name: int4_zeros} (values 0-15, shape [K//group_size, N])
    Returns:
        packed_dict: dict ready for GPU kernel consumption
    """
    packed_dict = {}

    for name, weights in state_dict.items():
        K, N = weights.shape
        assert K % 8 == 0

        # Pack 8 INT4 values into each uint32
        weights_uint8 = weights.to(torch.uint8)
        packed = torch.zeros(K // 8, N, dtype=torch.int32, device=weights.device)

        for i in range(8):
            packed |= (weights_uint8[i::8, :].to(torch.int32) & 0xF) << (i * 4)

        packed_dict[f"{name}.qweight"] = packed
        packed_dict[f"{name}.scales"] = scales_dict[name].to(torch.float16)

        # Pack zero-points similarly
        zp = zeros_dict[name]
        num_groups = zp.shape[0]
        if num_groups % 8 != 0:
            # Pad to multiple of 8
            pad_rows = 8 - (num_groups % 8)
            zp = torch.cat([zp, torch.zeros(pad_rows, N, dtype=zp.dtype,
                                            device=zp.device)])
            num_groups = zp.shape[0]

        packed_zp = torch.zeros(num_groups // 8, N, dtype=torch.int32,
                                device=zp.device)
        for i in range(8):
            packed_zp |= (zp[i::8, :].to(torch.int32) & 0xF) << (i * 4)

        packed_dict[f"{name}.qzeros"] = packed_zp

    return packed_dict

# Verification: unpack and check round-trip
def verify_packing(original, packed, K, N):
    """Verify that packing is lossless."""
    for k_packed in range(K // 8):
        for n in range(N):
            val = packed[k_packed, n].item()
            for i in range(8):
                unpacked = (val >> (i * 4)) & 0xF
                expected = original[k_packed * 8 + i, n].item()
                assert unpacked == expected, (
                    f"Mismatch at [{k_packed*8+i}, {n}]: "
                    f"got {unpacked}, expected {expected}"
                )
    print("Packing verification passed!")
⚠️ Endianness Matters

INT4 packing is sensitive to byte order. NVIDIA GPUs are little-endian, so the least significant nibble of a uint32 corresponds to the first INT4 value. If you pack on a big-endian CPU (rare but possible in some ARM configurations), you must byte-swap before uploading to the GPU. All major frameworks (PyTorch, TensorFlow) handle this automatically, but custom packing code must be careful.

Performance Analysis

The end-to-end performance of INT4 inference depends on the balance between reduced memory traffic (4x less data) and dequantization overhead.

For decode (batch=1, GEMV): INT4 is strictly better because GEMV is entirely memory-bandwidth bound. The 4x reduction in weight reads directly translates to ~3.5x speedup (accounting for scale/zero overhead).

For prefill (large batch, GEMM): INT4 may be slower than FP16 for large batch sizes where the GEMM becomes compute-bound. The dequantization overhead reduces effective FLOPS.

πŸ“Š

INT4 vs FP16 Performance Across Batch Sizes (Llama 70B, H100)

Batch SizeFP16 (ms)INT4 Marlin (ms)INT4 SpeedupBottleneck
1 38.2 11.1 3.44x Memory BW (both)
4 39.5 12.8 3.09x Memory BW (both)
16 45.1 18.2 2.48x Mixed
64 82.3 52.1 1.58x Compute (INT4 dequant overhead)
128 148.6 108.4 1.37x Compute (both)
256 285.2 242.8 1.17x Compute (both, INT4 overhead visible)
Note: At batch=256, INT4's dequantization overhead consumes most of the memory bandwidth savings. The crossover where FP16 becomes faster on H100 is around batch=512 for compute-bound layers.

Summary

INT4 weight packing is the foundation of quantized LLM serving. Two INT4 values share a single byte, requiring bit manipulation for packing (offline) and unpacking (every inference step). The memory layout β€” column-major (GPTQ), row-major (AWQ), or interleaved (Marlin) β€” determines how efficiently the GPU loads packed data. Dequantization must be fused with matrix multiplication to avoid writing intermediate FP16 values to memory. The Marlin kernel achieves 87-89% of FP16 cuBLAS throughput while loading 4x less data, making INT4 the dominant format for single-GPU LLM serving. Blackwell’s native INT4 Tensor Core support will eliminate the dequantization overhead entirely, closing the remaining gap to theoretical peak throughput.