Part of Series Quantization Masterclass 1 of 30
1 Number Formats for AI: FP32, BF16, FP16, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, INT8, INT4 2 Weight Quantization: GPTQ, AWQ, and Round-To-Nearest — Algorithms and Implementation 3 Activation Quantization: SmoothQuant, Per-Tensor Scaling, and W8A8 Inference 4 FP8 for Training and Inference: E4M3, E5M2, Transformer Engine, and Delayed Scaling 5 FP4 and MXFP4: The Blackwell Frontier — Sub-Byte Quantization for Next-Gen Inference 6 KV Cache Quantization: FP8, INT8, INT4, Per-Token Scaling, and the Quality-Memory Tradeoff 7 Quantization-Aware Training: Fake Quantization, Straight-Through Estimator, and QAT vs PTQ 8 Mixed Precision Inference: Which Ops Use Which Precision and Why 9 Calibration for Post-Training Quantization: MinMax, Percentile, MSE-Optimal, and Cross-Layer 10 Quantization Hardware Support: Tensor Core Precision Matrix, cuBLAS INT8, and Marlin Kernels 11 Per-Channel vs Per-Group vs Per-Tensor Scaling: Granularity Tradeoffs in Weight Quantization 12 The Outlier Channel Problem: Why LLM Activations Break Simple Quantization 13 W4A16 Inference: 4-Bit Weights with FP16 Activations and the Marlin Kernel 14 W8A8 INT8 Inference: cuBLAS INT8 GEMM, Per-Tensor Scaling, and When INT8 Beats FP8 15 GGUF Quantization Types: Q4_K_M, Q5_K_M, Q8_0 — How llama.cpp Quantizes for CPU 16 AWQ Deep Dive: Activation-Aware Weight Quantization — The Algorithm Step by Step 17 GPTQ Deep Dive: Hessian-Based One-Shot Quantization — OBS, Column-Wise Updates, and Lazy Batch 18 SqueezeLLM and Non-Uniform Quantization: Lookup Tables, Sparse Outliers, and Mixed Strategies 19 Quantization for Training: FP8 GEMM, Loss Scaling, and Why BF16 Remains the Default 20 Quantization Production Guide: Choosing the Right Method for Your Model, Hardware, and Latency SLO 21 Combining Sparsity and Quantization: 2:4 Structured Sparsity with INT8 for Maximum Throughput 22 Dynamic vs Static Quantization: Online Calibration, Offline Calibration, and When Each Wins 23 AQLM and Extreme Compression: 2-Bit Quantization with Additive Codebooks 24 Quantized Draft Models for Speculative Decoding: INT4 Drafters with FP16 Verification 25 Quantization Benchmarking: How to Properly Measure Quality Loss, Throughput, and Cost Impact 26 INT4 Weight Packing: Bit Manipulation, Dequantization Kernels, and Memory Layout 27 Serving Quantized Models: vLLM, TRT-LLM, and llama.cpp Integration 28 Debugging Quantization: Layer Sensitivity, Outlier Detection, and Quality Recovery 29 Future of Quantization: Sub-4-Bit, Ternary, and Binary Neural Networks 30 End-to-End Quantization Pipeline: From FP16 Checkpoint to Production INT4 Deployment
ℹ️ Before You Start

This series assumes you understand: (1) GEMM (General Matrix Multiplication) — the core operation in LLM inference, where the model multiplies input tensors by weight matrices. Every transformer layer runs multiple GEMMs (QKV projection, attention output, FFN up/down). Inference throughput is dominated by GEMM speed. (2) KV Cache — during autoregressive generation, the model stores Key and Value tensors from previous tokens to avoid recomputation. KV cache grows with sequence length and batch size, often consuming 30-60% of GPU memory. (3) Calibration data — a small representative dataset (128-512 samples) passed through the model to collect activation statistics (ranges, distributions). These statistics guide quantization scale/zero-point selection. Without calibration, quantization must use worst-case ranges, wasting precision. For deeper coverage of these concepts, see the Transformer Anatomy and Inference Optimization Timeline series.

FP16 has 5 exponent bits and 10 mantissa bits. BF16 has 8 exponent bits and 7 mantissa bits. That 3-bit shift from mantissa to exponent makes BF16 worse at representing small numbers (precision loss) but better at representing large numbers (range gain). For neural network training, gradients span six orders of magnitude—BF16’s extended range prevents underflow during backprop, which is why it displaced FP16 as the default training precision despite having 8x coarser precision near zero. FP8 E4M3 allocates 4 bits to exponent and 3 to mantissa, sacrificing range for precision. E5M2 does the opposite. INT8 has no exponent at all—uniform precision across its range. These bit allocation choices determine whether your quantized model achieves 2x speedup or crashes with NaN outputs.

This post is a complete reference for every number format used in modern AI: FP32, BF16, FP16, FP8 E4M3/E5M2, NVFP4, MXFP4, INT8, and INT4. We examine each format at the bit level, document which GPU architectures provide hardware support, and implement format conversion in Python so you can see exactly how bits map to representable values.

The Anatomy of a Floating-Point Number

Every floating-point format follows the same structure:

value=(1)s×2ebias×(1+m)\text{value} = (-1)^{s} \times 2^{e - \text{bias}} \times (1 + m)

where ss is the sign bit (0 or 1), ee is the stored exponent (an unsigned integer), bias is a format-specific constant that centers the exponent range around zero, and mm is the fractional part of the mantissa (the “1 +” is implicit for normal numbers).

The key design trade-off is always the same: given a fixed total bit budget, how do you split bits between exponent and mantissa?

  • More exponent bits = wider dynamic range (larger and smaller numbers representable)
  • More mantissa bits = finer precision (more distinct values between any two powers of two)

This trade-off is the single most important concept in this entire series. Every format we examine is a specific answer to this question.

FP32: IEEE 754 Single Precision

FP32 is the baseline format. It uses 32 bits: 1 sign, 8 exponent, 23 mantissa.

Bit layout (32 bits total):
[S | EEEEEEEE | MMMMMMMMMMMMMMMMMMMMMMM]
 1      8                  23
  • Exponent bias: 127
  • Range: ±3.4028×1038\pm 3.4028 \times 10^{38}
  • Smallest normal: 1.175×10381.175 \times 10^{-38}
  • Precision: ~7.2 decimal digits (2231.19×1072^{-23} \approx 1.19 \times 10^{-7} relative error)

FP32 is the “ground truth” format for AI. All other formats are measured against it. Model training historically used FP32 for all parameters, gradients, and optimizer states. The shift away from FP32 began with mixed-precision training (Micikevicius et al., 2018), which demonstrated that FP16 could be used for forward/backward passes while maintaining an FP32 master copy of weights.

Hardware support: Every GPU ever made supports FP32 arithmetic. On modern NVIDIA GPUs, FP32 CUDA cores provide 1x throughput relative to the tensor core peak at lower precisions.

import struct
import numpy as np

def fp32_to_bits(value):
    """Convert FP32 float to its 32-bit representation."""
    packed = struct.pack('>f', value)
    bits = int.from_bytes(packed, 'big')
    sign = (bits >> 31) & 1
    exponent = (bits >> 23) & 0xFF
    mantissa = bits & 0x7FFFFF
    return sign, exponent, mantissa

def bits_to_fp32(sign, exponent, mantissa):
    """Reconstruct FP32 from sign, exponent, mantissa."""
    bits = (sign << 31) | (exponent << 23) | mantissa
    packed = struct.pack('>I', bits)
    return struct.unpack('>f', packed)[0]

# Example: inspect 3.14
s, e, m = fp32_to_bits(3.14)
print(f"3.14 in FP32: sign={s}, exp={e} (unbiased={e-127}), mantissa=0x{m:06X}")
# sign=0, exp=128 (unbiased=1), mantissa=0x48F5C3
# value = (-1)^0 * 2^1 * (1 + 0x48F5C3/0x800000) = 2 * 1.57 = 3.14

BF16: Brain Floating Point

BF16 was designed by Google Brain specifically for deep learning. It uses 16 bits: 1 sign, 8 exponent, 7 mantissa.

Bit layout (16 bits total):
[S | EEEEEEEE | MMMMMMM]
 1      8          7
  • Exponent bias: 127 (same as FP32)
  • Range: ±3.39×1038\pm 3.39 \times 10^{38} (same as FP32)
  • Smallest normal: 1.175×10381.175 \times 10^{-38} (same as FP32)
  • Precision: ~2.15 decimal digits (270.00782^{-7} \approx 0.0078 relative error)

BF16 is simply the upper 16 bits of FP32. Truncating an FP32 value to BF16 requires no rescaling — you just drop the lower 16 mantissa bits. This is why conversion between FP32 and BF16 is trivial.

Why BF16 Won Over FP16 for Training

The critical advantage of BF16 over FP16 is range. BF16 has 8 exponent bits, matching FP32’s range exactly. FP16 has only 5 exponent bits, giving a maximum value of 65504. In deep learning training, gradient values, loss values, and intermediate activations can easily exceed 65504, causing overflow to infinity. With FP16, you must use loss scaling to keep gradients in range. With BF16, you do not — the range matches FP32, so nothing overflows.

The precision loss (7 mantissa bits vs 10 for FP16) rarely matters in practice. Neural network training is inherently noisy due to stochastic gradient descent, and the noise from BF16 rounding is negligible compared to the noise from minibatch sampling.

💡 BF16 = Truncated FP32

Converting FP32 to BF16 is a single right-shift: drop the lower 16 bits of the mantissa. No exponent adjustment, no rescaling, no loss scaling needed. This is why BF16 has become the default training format — it is the simplest possible 16-bit format that preserves FP32 range.

def fp32_to_bf16(value):
    """Convert FP32 to BF16 by truncating lower 16 mantissa bits."""
    fp32_bits = struct.pack('>f', value)
    fp32_int = int.from_bytes(fp32_bits, 'big')

    # Round to nearest even (banker's rounding)
    rounding_bias = 0x7FFF + ((fp32_int >> 16) & 1)
    bf16_int = (fp32_int + rounding_bias) >> 16

    return bf16_int & 0xFFFF

def bf16_to_fp32(bf16_int):
    """Convert BF16 back to FP32 by zero-padding lower 16 bits."""
    fp32_int = bf16_int << 16
    packed = struct.pack('>I', fp32_int)
    return struct.unpack('>f', packed)[0]

# Demonstrate range preservation
val = 1e35
bf16 = fp32_to_bf16(val)
recovered = bf16_to_fp32(bf16)
print(f"Original: {val:.4e}, BF16 recovered: {recovered:.4e}")
# Both ~1e35 -- range preserved, precision lost in lower digits

Hardware support: NVIDIA A100 (Ampere) and later, Google TPUv2 and later, AMD MI250 and later, Intel Gaudi.

FP16: IEEE 754 Half Precision

FP16 is the original IEEE half-precision format. It uses 16 bits: 1 sign, 5 exponent, 10 mantissa.

Bit layout (16 bits total):
[S | EEEEE | MMMMMMMMMM]
 1     5         10
  • Exponent bias: 15
  • Range: ±65504\pm 65504 (max value)
  • Smallest normal: 6.1×1056.1 \times 10^{-5}
  • Precision: ~3.31 decimal digits (2109.77×1042^{-10} \approx 9.77 \times 10^{-4} relative error)

FP16 has better precision than BF16 (10 vs 7 mantissa bits) but dramatically worse range (65504 vs 3.4×10383.4 \times 10^{38}). The limited range is FP16’s fatal flaw for training. Any value exceeding 65504 becomes infinity. Gradient norms during training frequently exceed this threshold, requiring loss scaling to keep computations in range.

📊

FP16 vs BF16: The Precision-Range Trade-off

PropertyFP16BF16Winner for Training
Total bits 16 16 Tie
Exponent bits 5 8 BF16
Mantissa bits 10 7 FP16
Max value 65,504 3.39e38 BF16
Min normal 6.1e-5 1.18e-38 BF16
Precision (decimal digits) 3.31 2.15 FP16
Needs loss scaling? Yes No BF16
FP32 conversion cost Rescale Bit truncation BF16
Note: BF16 wins on 5 of 7 criteria relevant to training. FP16 is preferred only when precision matters more than range (some inference scenarios).

For inference, FP16 remains useful because the range limitation matters less — weights and activations are fixed, and you can calibrate scale factors to keep everything in range. The extra precision (10 vs 7 mantissa bits) can improve quality in precision-sensitive inference workloads.

Hardware support: NVIDIA Pascal (P100) and later for tensor cores, all GPUs for CUDA cores. AMD MI-series. Apple M-series. Qualcomm Adreno.

def fp32_to_fp16(value):
    """Convert FP32 to FP16 with proper exponent rebias."""
    fp32_bits = struct.pack('>f', value)
    fp32_int = int.from_bytes(fp32_bits, 'big')

    sign = (fp32_int >> 31) & 1
    exp = (fp32_int >> 23) & 0xFF
    frac = fp32_int & 0x7FFFFF

    # Handle special cases
    if exp == 0xFF:  # inf or NaN
        fp16_exp = 0x1F
        fp16_frac = 0 if frac == 0 else 0x200
    elif exp == 0:  # zero or subnormal
        fp16_exp = 0
        fp16_frac = 0
    else:
        new_exp = exp - 127 + 15  # Rebias from 127 to 15
        if new_exp >= 0x1F:  # Overflow to infinity
            fp16_exp = 0x1F
            fp16_frac = 0
        elif new_exp <= 0:  # Underflow to subnormal or zero
            fp16_exp = 0
            fp16_frac = 0
        else:
            fp16_exp = new_exp
            fp16_frac = frac >> 13  # Keep top 10 mantissa bits

    fp16_int = (sign << 15) | (fp16_exp << 10) | fp16_frac
    return fp16_int

# Demonstrate range limitation
try:
    large = 100000.0
    fp16 = fp32_to_fp16(large)
    sign = (fp16 >> 15) & 1
    exp = (fp16 >> 10) & 0x1F
    print(f"100000.0 -> FP16 exp={exp:#x}")  # exp=0x1f = infinity
except Exception as ex:
    print(f"Overflow: {ex}")

FP8 E4M3: Precision-Optimized 8-Bit Float

FP8 E4M3 uses 8 bits: 1 sign, 4 exponent, 3 mantissa.

Bit layout (8 bits total):
[S | EEEE | MMM]
 1    4      3
  • Exponent bias: 7
  • Range: ±448\pm 448 (max value; NaN encoding uses the all-ones pattern)
  • Smallest normal: 26=0.0156252^{-6} = 0.015625
  • Precision: ~1.2 decimal digits (23=0.1252^{-3} = 0.125 relative error)
  • Unique representable values: 240 (excluding NaN, +/-0 counted once)

E4M3 sacrifices range for precision compared to E5M2. The maximum value of 448 is modest, but 3 mantissa bits provide 8 distinct values between any two consecutive powers of two. This makes E4M3 the preferred format for forward pass computations where weight and activation values have been scaled to fit within the representable range, and you want the best possible precision for accumulation accuracy.

ℹ️ E4M3 Special Encoding

Unlike IEEE floats, FP8 E4M3 (as defined in the OCP spec) does not have infinity. The bit pattern that would normally encode infinity (exponent all 1s, mantissa all 0s) instead encodes the value 448. Only the single pattern S1111_111 (exponent=15, mantissa=7) encodes NaN. This gives E4M3 one more usable value compared to a strict IEEE interpretation.

def fp32_to_e4m3(value):
    """Convert FP32 to FP8 E4M3 format."""
    # Clamp to representable range
    max_val = 448.0
    value = max(min(value, max_val), -max_val)

    if value == 0.0:
        return 0x00

    sign = 0
    if value < 0:
        sign = 1
        value = -value

    # Find exponent: value = 2^exp * mantissa, 1.0 <= mantissa < 2.0
    import math
    exp = math.floor(math.log2(value))
    exp_biased = exp + 7  # bias = 7

    if exp_biased <= 0:
        # Subnormal: exponent stored as 0, mantissa = value / 2^(-6)
        exp_biased = 0
        mantissa_frac = value / (2 ** -6)
        mantissa_bits = int(round(mantissa_frac * (1 << 3))) & 0x07
    elif exp_biased >= 15:
        # Clamp to max normal (not infinity in E4M3)
        exp_biased = 15
        mantissa_bits = 0x06  # 448 = 2^8 * 1.75
    else:
        mantissa_frac = value / (2 ** exp) - 1.0  # Remove implicit 1
        mantissa_bits = int(round(mantissa_frac * (1 << 3))) & 0x07
        if mantissa_bits == 8:  # Rounding overflow
            mantissa_bits = 0
            exp_biased += 1

    result = (sign << 7) | (exp_biased << 3) | mantissa_bits
    return result

def e4m3_to_fp32(byte_val):
    """Convert FP8 E4M3 back to FP32."""
    sign = (byte_val >> 7) & 1
    exp = (byte_val >> 3) & 0x0F
    mantissa = byte_val & 0x07

    if exp == 0x0F and mantissa == 0x07:
        return float('nan')

    if exp == 0:
        # Subnormal
        value = (mantissa / 8.0) * (2 ** -6)
    else:
        value = (1.0 + mantissa / 8.0) * (2 ** (exp - 7))

    return -value if sign else value

# Enumerate some values
print("FP8 E4M3 sample values:")
for bits in [0x00, 0x01, 0x38, 0x3F, 0x7E, 0x7F]:
    val = e4m3_to_fp32(bits)
    print(f"  0x{bits:02X} -> {val}")

FP8 E5M2: Range-Optimized 8-Bit Float

FP8 E5M2 uses 8 bits: 1 sign, 5 exponent, 2 mantissa.

Bit layout (8 bits total):
[S | EEEEE | MM]
 1     5      2
  • Exponent bias: 15
  • Range: ±57344\pm 57344 (max finite value)
  • Smallest normal: 2146.1×1052^{-14} \approx 6.1 \times 10^{-5}
  • Precision: ~0.9 decimal digits (22=0.252^{-2} = 0.25 relative error)
  • Unique representable values: 110 (excluding inf/NaN)

E5M2 follows IEEE conventions more closely: it has infinity and NaN encodings. The 5 exponent bits give the same dynamic range as FP16. The 2 mantissa bits mean only 4 distinct values exist between consecutive powers of two — coarse precision, but the wide range is critical for backward pass computations where gradient magnitudes vary enormously.

E4M3 vs E5M2: The Precision-Range Trade-off at 8 Bits

📊

FP8 E4M3 vs E5M2 Comparison

PropertyE4M3E5M2
Exponent bits 4 5
Mantissa bits 3 2
Max value 448 57,344
Min normal 0.015625 6.1e-5
Values per power-of-2 8 4
Has infinity? No Yes
Total unique values 240 110
Primary use Forward pass Backward pass
Note: E4M3 for forward (precision matters), E5M2 for backward (range matters). This split was introduced by NVIDIA Transformer Engine.

The recommended practice (established by Micikevicius et al., 2022) is to use E4M3 for the forward pass and E5M2 for the backward pass. Forward pass values (weights and activations) are relatively well-bounded and benefit from precision. Backward pass values (gradients) have extreme dynamic range and benefit from the wider exponent.

def fp32_to_e5m2(value):
    """Convert FP32 to FP8 E5M2 format."""
    max_val = 57344.0
    if abs(value) > max_val:
        # Return infinity
        sign = 1 if value < 0 else 0
        return (sign << 7) | 0x7C  # exp=31, mantissa=0

    if value == 0.0:
        return 0x00

    sign = 0
    if value < 0:
        sign = 1
        value = -value

    import math
    exp = math.floor(math.log2(value))
    exp_biased = exp + 15

    if exp_biased <= 0:
        exp_biased = 0
        mantissa_frac = value / (2 ** -14)
        mantissa_bits = int(round(mantissa_frac * 4)) & 0x03
    elif exp_biased >= 31:
        return (sign << 7) | 0x7C  # infinity
    else:
        mantissa_frac = value / (2 ** exp) - 1.0
        mantissa_bits = int(round(mantissa_frac * 4)) & 0x03
        if mantissa_bits == 4:
            mantissa_bits = 0
            exp_biased += 1

    return (sign << 7) | (exp_biased << 2) | mantissa_bits

def e5m2_to_fp32(byte_val):
    """Convert FP8 E5M2 back to FP32."""
    sign = (byte_val >> 7) & 1
    exp = (byte_val >> 2) & 0x1F
    mantissa = byte_val & 0x03

    if exp == 0x1F:
        if mantissa == 0:
            return float('-inf') if sign else float('inf')
        return float('nan')

    if exp == 0:
        value = (mantissa / 4.0) * (2 ** -14)
    else:
        value = (1.0 + mantissa / 4.0) * (2 ** (exp - 15))

    return -value if sign else value

# Compare representable values around 1.0
print("Values near 1.0:")
print(f"  E4M3: {[e4m3_to_fp32(b) for b in range(0x38, 0x40)]}")
print(f"  E5M2: {[e5m2_to_fp32(b) for b in range(0x3C, 0x40)]}")

NVFP4: NVIDIA’s 4-Bit Floating Point

NVFP4 is NVIDIA’s proprietary 4-bit floating point format, introduced with the Blackwell architecture (B100/B200). It uses 4 bits: 1 sign, 2 exponent, 1 mantissa.

Bit layout (4 bits total):
[S | EE | M]
 1   2    1
  • Exponent bias: 1
  • Range: ±6.0\pm 6.0 (max value)
  • Smallest normal: 0.50.5
  • Precision: 1 mantissa bit means only 2 values per power-of-two interval
  • Unique representable values: 12 (excluding zeros)

With only 4 bits, NVFP4 can represent very few distinct values. The entire set of non-negative NVFP4 values is:

0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0

This is strikingly coarse. To make NVFP4 usable, every block of values shares a scaling factor stored in a higher precision format. A typical configuration uses one FP8 E4M3 scale factor per 16 or 32 NVFP4 elements:

xdequant=scaleblock×xNVFP4x_{\text{dequant}} = \text{scale}_{\text{block}} \times x_{\text{NVFP4}}

This block scaling is what makes 4-bit formats viable — the scale factor shifts the representable range to wherever the actual data lives, and the 4-bit values provide relative precision within that range.

def encode_nvfp4(value):
    """Encode a single value to NVFP4 (4 bits)."""
    # NVFP4 representable positive values: 0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0
    nvfp4_values = [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0]

    sign = 0
    if value < 0:
        sign = 1
        value = -value

    # Find nearest representable value
    best_idx = 0
    best_dist = abs(value - nvfp4_values[0])
    for i, v in enumerate(nvfp4_values):
        dist = abs(value - v)
        if dist < best_dist:
            best_dist = dist
            best_idx = i

    return (sign << 3) | best_idx

def quantize_block_nvfp4(block, block_size=16):
    """Quantize a block of FP32 values to NVFP4 with shared scale."""
    import numpy as np
    block = np.array(block, dtype=np.float32)

    # Compute per-block scale: max absolute value / max NVFP4 value
    amax = np.max(np.abs(block))
    max_nvfp4 = 6.0
    scale = amax / max_nvfp4 if amax > 0 else 1.0

    # Quantize each element
    scaled = block / scale
    codes = []
    for val in scaled:
        codes.append(encode_nvfp4(float(val)))

    return codes, scale

# Example
import numpy as np
block = np.random.randn(16).astype(np.float32) * 0.1
codes, scale = quantize_block_nvfp4(block)
print(f"Scale: {scale:.6f}")
print(f"Codes: {[f'0x{c:X}' for c in codes]}")

MXFP4: Microscaling FP4

MXFP4 is defined by the Open Compute Project (OCP) Microscaling Formats specification. It is similar in spirit to NVFP4 but uses a standardized shared exponent (microscaling) approach.

Bit layout per element (4 bits):
[S | EE | M]
 1   2    1

Shared exponent per block (8 bits):
[EEEEEEEE]
    8

Each block of 32 elements shares a single 8-bit exponent. The per-element 4-bit value provides the sign and a 3-bit significand (2 exponent + 1 mantissa). The shared exponent provides the coarse scale, and the per-element bits provide fine positioning within that scale.

The key difference from NVFP4 with a separate scale factor: in MXFP4, the shared exponent is part of the format specification itself. The hardware knows the block size and exponent format, enabling direct tensor core consumption without a separate dequantization step.

MXFP4 vs NVFP4

MXFP4 and NVFP4 both achieve 4-bit inference, but MXFP4 is an open standard (OCP) while NVFP4 is NVIDIA-proprietary. On Blackwell, both formats run at the same tensor core throughput. The practical difference is in the scaling granularity: MXFP4 mandates a specific block structure (32 elements per shared exponent), while NVFP4 allows flexible block sizes.

def quantize_block_mxfp4(block):
    """Quantize a 32-element block to MXFP4 format.

    Returns (shared_exponent, element_codes) where each code is 4 bits.
    """
    import numpy as np
    import math
    block = np.array(block, dtype=np.float32)

    # Compute shared exponent from max absolute value
    amax = np.max(np.abs(block))
    if amax == 0:
        return 0, [0] * len(block)

    # Shared exponent: floor(log2(amax))
    shared_exp = int(math.floor(math.log2(amax)))
    # Clamp to 8-bit unsigned range
    shared_exp = max(0, min(255, shared_exp + 127))  # Bias = 127

    # Scale block by shared exponent
    scale = 2.0 ** (shared_exp - 127)
    scaled = block / scale

    # Quantize each element to 4-bit: sign(1) + exp(2) + mantissa(1)
    mxfp4_table = [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0]
    codes = []
    for val in scaled:
        sign = 0
        if val < 0:
            sign = 1
            val = -val
        # Round to nearest MXFP4 value
        best_idx = min(range(len(mxfp4_table)),
                       key=lambda i: abs(val - mxfp4_table[i]))
        codes.append((sign << 3) | best_idx)

    return shared_exp, codes

def dequantize_block_mxfp4(shared_exp, codes):
    """Dequantize MXFP4 block back to FP32."""
    mxfp4_table = [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0]
    scale = 2.0 ** (shared_exp - 127)
    values = []
    for code in codes:
        sign = (code >> 3) & 1
        idx = code & 0x07
        val = mxfp4_table[idx] * scale
        values.append(-val if sign else val)
    return values

# Round-trip test
block = np.random.randn(32).astype(np.float32)
exp, codes = quantize_block_mxfp4(block)
recovered = dequantize_block_mxfp4(exp, codes)
mse = np.mean((block - np.array(recovered)) ** 2)
print(f"Shared exponent: {exp}, MSE: {mse:.6f}")

INT8: 8-Bit Integer

INT8 is the simplest quantized format. It uses 8 bits as a signed integer with no exponent or mantissa — just a linear mapping from integers to real values via a scale factor and zero point.

Bit layout (8 bits):
[SSSSSSSS]  (two's complement signed integer)
Range: -128 to 127

The mapping from INT8 to real values is:

xreal=scale×(xint8zero_point)x_{\text{real}} = \text{scale} \times (x_{\text{int8}} - \text{zero\_point})

INT8 quantization is uniform: the spacing between consecutive representable values is constant (equal to the scale factor). This contrasts with floating-point formats where spacing increases with magnitude.

⚠️ INT8 and Outliers

Uniform quantization is poorly suited to distributions with outliers. If your data has a few values at 100x the typical magnitude, the scale factor must accommodate them, leaving only ~1% of the integer range for the typical values. This is the fundamental motivation behind SmoothQuant and other outlier-handling techniques covered in Part 3.

def quantize_symmetric_int8(tensor, per_channel=True):
    """Symmetric INT8 quantization (zero_point = 0)."""
    import numpy as np
    tensor = np.array(tensor, dtype=np.float32)

    if per_channel:
        # Scale per output channel (axis 0)
        amax = np.max(np.abs(tensor), axis=1, keepdims=True)
    else:
        # Single scale for entire tensor
        amax = np.max(np.abs(tensor))

    scale = amax / 127.0
    scale = np.where(scale == 0, 1.0, scale)  # Avoid division by zero

    quantized = np.clip(np.round(tensor / scale), -128, 127).astype(np.int8)
    return quantized, scale

def dequantize_int8(quantized, scale):
    """Dequantize INT8 back to FP32."""
    return quantized.astype(np.float32) * scale

# Example: quantize a weight matrix
weight = np.random.randn(64, 256).astype(np.float32) * 0.02
q_weight, scale = quantize_symmetric_int8(weight, per_channel=True)
recon = dequantize_int8(q_weight, scale)
mse = np.mean((weight - recon) ** 2)
print(f"INT8 quantization MSE: {mse:.8f}")
print(f"Scale shape: {scale.shape}")  # (64, 1) for per-channel

Hardware support: NVIDIA Turing (T4) and later for INT8 tensor cores. All GPUs support INT8 via CUDA cores. AMD MI-series. Intel Gaudi. Qualcomm Hexagon DSP.

INT4: 4-Bit Integer

INT4 uses 4 bits as either a signed integer (-8 to 7) or unsigned integer (0 to 15). In practice, signed INT4 with symmetric quantization is most common for weight quantization.

Bit layout (4 bits):
[SSSS]  (signed: -8 to 7, unsigned: 0 to 15)

With only 16 distinct values, INT4 is extremely coarse. Per-group quantization (one scale factor per group of 32-128 weights) is essential to maintain quality.

xreal=scaleg×xint4x_{\text{real}} = \text{scale}_g \times x_{\text{int4}}

where gg indexes the group that the element belongs to.

def quantize_int4_grouped(tensor, group_size=128):
    """INT4 quantization with per-group scaling."""
    import numpy as np
    tensor = np.array(tensor, dtype=np.float32)
    rows, cols = tensor.shape

    # Pad columns to multiple of group_size
    pad = (group_size - cols % group_size) % group_size
    if pad > 0:
        tensor = np.pad(tensor, ((0, 0), (0, pad)))

    cols_padded = tensor.shape[1]
    num_groups = cols_padded // group_size

    # Reshape to (rows, num_groups, group_size)
    grouped = tensor.reshape(rows, num_groups, group_size)

    # Per-group scale
    amax = np.max(np.abs(grouped), axis=2, keepdims=True)
    scale = amax / 7.0  # Symmetric: map to [-7, 7]
    scale = np.where(scale == 0, 1.0, scale)

    quantized = np.clip(np.round(grouped / scale), -8, 7).astype(np.int8)
    return quantized, scale, cols

def dequantize_int4_grouped(quantized, scale, original_cols):
    """Dequantize INT4 groups back to FP32."""
    recon = quantized.astype(np.float32) * scale
    rows = recon.shape[0]
    recon = recon.reshape(rows, -1)[:, :original_cols]
    return recon

# Example: group_size=128
weight = np.random.randn(256, 1024).astype(np.float32) * 0.01
q, s, orig_cols = quantize_int4_grouped(weight, group_size=128)
recon = dequantize_int4_grouped(q, s, orig_cols)
mse = np.mean((weight - recon) ** 2)
print(f"INT4 g128 MSE: {mse:.8f}")

GPU Hardware Support Matrix

📊

Number Format Hardware Support by GPU Architecture

FormatV100 (Volta)T4 (Turing)A100 (Ampere)H100 (Hopper)B200 (Blackwell)
FP32 Yes Yes Yes Yes Yes
FP16 Tensor Yes Yes Yes Yes Yes
BF16 Tensor No No Yes Yes Yes
INT8 Tensor No Yes Yes Yes Yes
FP8 E4M3 No No No Yes Yes
FP8 E5M2 No No No Yes Yes
INT4 Tensor No No No No Yes
NVFP4 No No No No Yes
MXFP4 No No No No Yes
Note: 'Tensor' means supported by tensor cores with native throughput. INT4 and FP4 formats on Blackwell achieve 2x TOPS vs FP8.

Peak Tensor Core TOPS by Format (Single GPU)

(TOPS)
FP32 (V100)
15 TOPS
FP16 (A100)
312 TOPS
BF16 (A100)
312 TOPS
FP8 (H100)
1,979 TOPS
FP4 (B200) 2.3x vs FP8
4,500 TOPS

Comprehensive Format Comparison

📊

All AI Number Formats at a Glance

FormatBitsSignExponentMantissaMax ValuePrecision (decimal)Values per 2x interval
FP32 32 1 8 23 3.4e38 7.2 8,388,608
BF16 16 1 8 7 3.4e38 2.2 128
FP16 16 1 5 10 65,504 3.3 1,024
FP8 E4M3 8 1 4 3 448 1.2 8
FP8 E5M2 8 1 5 2 57,344 0.9 4
NVFP4 4 1 2 1 6.0 N/A 2
MXFP4 4+shared 1 2 1 block-scaled N/A 2
INT8 8 N/A N/A N/A 127 (uniform) N/A N/A (uniform)
INT4 4 N/A N/A N/A 7 (uniform) N/A N/A (uniform)
Note: Floating-point formats have non-uniform spacing (denser near zero). Integer formats have uniform spacing determined by scale factor. MXFP4 effective bits include amortized shared exponent cost.

Putting It All Together: Universal Format Converter

Here is a complete Python module that converts between all formats:

import numpy as np
import struct

class NumberFormatConverter:
    """Convert between AI number formats."""

    @staticmethod
    def fp32_to_bf16_array(arr):
        """Vectorized FP32 to BF16."""
        arr = np.asarray(arr, dtype=np.float32)
        int32 = arr.view(np.uint32)
        # Round to nearest even
        rounding = np.uint32(0x7FFF) + ((int32 >> 16) & np.uint32(1))
        bf16 = ((int32 + rounding) >> 16).astype(np.uint16)
        return bf16

    @staticmethod
    def bf16_to_fp32_array(bf16_arr):
        """Vectorized BF16 to FP32."""
        bf16_arr = np.asarray(bf16_arr, dtype=np.uint16)
        fp32_int = bf16_arr.astype(np.uint32) << 16
        return fp32_int.view(np.float32)

    @staticmethod
    def quantize_e4m3(arr, block_size=32):
        """Quantize FP32 array to E4M3 with per-block scaling."""
        arr = np.asarray(arr, dtype=np.float32).flatten()
        n = len(arr)
        pad = (block_size - n % block_size) % block_size
        if pad:
            arr = np.concatenate([arr, np.zeros(pad, dtype=np.float32)])

        num_blocks = len(arr) // block_size
        blocks = arr.reshape(num_blocks, block_size)
        amax = np.max(np.abs(blocks), axis=1, keepdims=True)
        scales = amax / 448.0
        scales = np.where(scales == 0, 1.0, scales)
        scaled = blocks / scales

        # Map to nearest E4M3 value using lookup table
        e4m3_pos = np.array([
            0.0, 0.001953125, 0.00390625, 0.005859375,  # subnormals
            0.0078125, 0.009765625, 0.01171875, 0.013671875,  # ... many values
            0.015625, 0.01953125, 0.0234375, 0.02734375,
            0.03125, 0.0390625, 0.046875, 0.0546875,
            0.0625, 0.078125, 0.09375, 0.109375,
            0.125, 0.15625, 0.1875, 0.21875,
            0.25, 0.3125, 0.375, 0.4375,
            0.5, 0.625, 0.75, 0.875,
            1.0, 1.25, 1.5, 1.75,
            2.0, 2.5, 3.0, 3.5,
            4.0, 5.0, 6.0, 7.0,
            8.0, 10.0, 12.0, 14.0,
            16.0, 20.0, 24.0, 28.0,
            32.0, 40.0, 48.0, 56.0,
            64.0, 80.0, 96.0, 112.0,
            128.0, 160.0, 192.0, 224.0,
            256.0, 320.0, 384.0, 448.0,
        ], dtype=np.float32)

        codes = np.zeros_like(scaled, dtype=np.uint8)
        for i in range(num_blocks):
            for j in range(block_size):
                val = scaled[i, j]
                sign = 0
                if val < 0:
                    sign = 1
                    val = -val
                idx = np.argmin(np.abs(e4m3_pos - val))
                codes[i, j] = (sign << 7) | idx

        return codes[:, :], scales, n

    @staticmethod
    def measure_error(original, reconstructed):
        """Compute quantization error metrics."""
        mse = np.mean((original - reconstructed) ** 2)
        rmse = np.sqrt(mse)
        max_err = np.max(np.abs(original - reconstructed))
        snr = 10 * np.log10(np.mean(original ** 2) / (mse + 1e-10))
        return {
            'mse': float(mse),
            'rmse': float(rmse),
            'max_error': float(max_err),
            'snr_db': float(snr),
        }

# Compare quantization error across formats
data = np.random.randn(1024).astype(np.float32) * 0.1

# BF16
bf16 = NumberFormatConverter.fp32_to_bf16_array(data)
bf16_recon = NumberFormatConverter.bf16_to_fp32_array(bf16)
print("BF16:", NumberFormatConverter.measure_error(data, bf16_recon))

# INT8
scale = np.max(np.abs(data)) / 127.0
int8_q = np.clip(np.round(data / scale), -128, 127).astype(np.int8)
int8_recon = int8_q.astype(np.float32) * scale
print("INT8:", NumberFormatConverter.measure_error(data, int8_recon))

# INT4
scale4 = np.max(np.abs(data)) / 7.0
int4_q = np.clip(np.round(data / scale4), -8, 7).astype(np.int8)
int4_recon = int4_q.astype(np.float32) * scale4
print("INT4:", NumberFormatConverter.measure_error(data, int4_recon))

Choosing the Right Format

The decision tree for format selection in practice:

  1. Training master weights and optimizer states: FP32. No compromise.
  2. Training forward/backward compute: BF16 by default. FP8 E4M3/E5M2 on Hopper and later if training is stable at this precision (see Part 4).
  3. Inference weights, quality-sensitive: FP8 E4M3 or INT8. Minimal quality loss, 2x memory savings over FP16.
  4. Inference weights, throughput-optimized: INT4 with per-group scaling (GPTQ/AWQ, see Part 2). 4x memory savings.
  5. Inference weights, maximum throughput on Blackwell: NVFP4 or MXFP4. 2x throughput over FP8 tensor cores.
  6. KV cache: FP8 for minimal quality loss, INT4 for maximum memory savings (see Part 6).

The rest of this series will show you exactly how to implement each of these, starting with weight quantization algorithms in the next post.