This series assumes you understand: (1) GEMM (General Matrix Multiplication) — the core operation in LLM inference, where the model multiplies input tensors by weight matrices. Every transformer layer runs multiple GEMMs (QKV projection, attention output, FFN up/down). Inference throughput is dominated by GEMM speed. (2) KV Cache — during autoregressive generation, the model stores Key and Value tensors from previous tokens to avoid recomputation. KV cache grows with sequence length and batch size, often consuming 30-60% of GPU memory. (3) Calibration data — a small representative dataset (128-512 samples) passed through the model to collect activation statistics (ranges, distributions). These statistics guide quantization scale/zero-point selection. Without calibration, quantization must use worst-case ranges, wasting precision. For deeper coverage of these concepts, see the Transformer Anatomy and Inference Optimization Timeline series.
FP16 has 5 exponent bits and 10 mantissa bits. BF16 has 8 exponent bits and 7 mantissa bits. That 3-bit shift from mantissa to exponent makes BF16 worse at representing small numbers (precision loss) but better at representing large numbers (range gain). For neural network training, gradients span six orders of magnitude—BF16’s extended range prevents underflow during backprop, which is why it displaced FP16 as the default training precision despite having 8x coarser precision near zero. FP8 E4M3 allocates 4 bits to exponent and 3 to mantissa, sacrificing range for precision. E5M2 does the opposite. INT8 has no exponent at all—uniform precision across its range. These bit allocation choices determine whether your quantized model achieves 2x speedup or crashes with NaN outputs.
This post is a complete reference for every number format used in modern AI: FP32, BF16, FP16, FP8 E4M3/E5M2, NVFP4, MXFP4, INT8, and INT4. We examine each format at the bit level, document which GPU architectures provide hardware support, and implement format conversion in Python so you can see exactly how bits map to representable values.
The Anatomy of a Floating-Point Number
Every floating-point format follows the same structure:
where is the sign bit (0 or 1), is the stored exponent (an unsigned integer), bias is a format-specific constant that centers the exponent range around zero, and is the fractional part of the mantissa (the “1 +” is implicit for normal numbers).
The key design trade-off is always the same: given a fixed total bit budget, how do you split bits between exponent and mantissa?
- More exponent bits = wider dynamic range (larger and smaller numbers representable)
- More mantissa bits = finer precision (more distinct values between any two powers of two)
This trade-off is the single most important concept in this entire series. Every format we examine is a specific answer to this question.
FP32: IEEE 754 Single Precision
FP32 is the baseline format. It uses 32 bits: 1 sign, 8 exponent, 23 mantissa.
Bit layout (32 bits total):
[S | EEEEEEEE | MMMMMMMMMMMMMMMMMMMMMMM]
1 8 23
- Exponent bias: 127
- Range:
- Smallest normal:
- Precision: ~7.2 decimal digits ( relative error)
FP32 is the “ground truth” format for AI. All other formats are measured against it. Model training historically used FP32 for all parameters, gradients, and optimizer states. The shift away from FP32 began with mixed-precision training (Micikevicius et al., 2018), which demonstrated that FP16 could be used for forward/backward passes while maintaining an FP32 master copy of weights.
Hardware support: Every GPU ever made supports FP32 arithmetic. On modern NVIDIA GPUs, FP32 CUDA cores provide 1x throughput relative to the tensor core peak at lower precisions.
import struct
import numpy as np
def fp32_to_bits(value):
"""Convert FP32 float to its 32-bit representation."""
packed = struct.pack('>f', value)
bits = int.from_bytes(packed, 'big')
sign = (bits >> 31) & 1
exponent = (bits >> 23) & 0xFF
mantissa = bits & 0x7FFFFF
return sign, exponent, mantissa
def bits_to_fp32(sign, exponent, mantissa):
"""Reconstruct FP32 from sign, exponent, mantissa."""
bits = (sign << 31) | (exponent << 23) | mantissa
packed = struct.pack('>I', bits)
return struct.unpack('>f', packed)[0]
# Example: inspect 3.14
s, e, m = fp32_to_bits(3.14)
print(f"3.14 in FP32: sign={s}, exp={e} (unbiased={e-127}), mantissa=0x{m:06X}")
# sign=0, exp=128 (unbiased=1), mantissa=0x48F5C3
# value = (-1)^0 * 2^1 * (1 + 0x48F5C3/0x800000) = 2 * 1.57 = 3.14
BF16: Brain Floating Point
BF16 was designed by Google Brain specifically for deep learning. It uses 16 bits: 1 sign, 8 exponent, 7 mantissa.
Bit layout (16 bits total):
[S | EEEEEEEE | MMMMMMM]
1 8 7
- Exponent bias: 127 (same as FP32)
- Range: (same as FP32)
- Smallest normal: (same as FP32)
- Precision: ~2.15 decimal digits ( relative error)
BF16 is simply the upper 16 bits of FP32. Truncating an FP32 value to BF16 requires no rescaling — you just drop the lower 16 mantissa bits. This is why conversion between FP32 and BF16 is trivial.
Why BF16 Won Over FP16 for Training
The critical advantage of BF16 over FP16 is range. BF16 has 8 exponent bits, matching FP32’s range exactly. FP16 has only 5 exponent bits, giving a maximum value of 65504. In deep learning training, gradient values, loss values, and intermediate activations can easily exceed 65504, causing overflow to infinity. With FP16, you must use loss scaling to keep gradients in range. With BF16, you do not — the range matches FP32, so nothing overflows.
The precision loss (7 mantissa bits vs 10 for FP16) rarely matters in practice. Neural network training is inherently noisy due to stochastic gradient descent, and the noise from BF16 rounding is negligible compared to the noise from minibatch sampling.
Converting FP32 to BF16 is a single right-shift: drop the lower 16 bits of the mantissa. No exponent adjustment, no rescaling, no loss scaling needed. This is why BF16 has become the default training format — it is the simplest possible 16-bit format that preserves FP32 range.
def fp32_to_bf16(value):
"""Convert FP32 to BF16 by truncating lower 16 mantissa bits."""
fp32_bits = struct.pack('>f', value)
fp32_int = int.from_bytes(fp32_bits, 'big')
# Round to nearest even (banker's rounding)
rounding_bias = 0x7FFF + ((fp32_int >> 16) & 1)
bf16_int = (fp32_int + rounding_bias) >> 16
return bf16_int & 0xFFFF
def bf16_to_fp32(bf16_int):
"""Convert BF16 back to FP32 by zero-padding lower 16 bits."""
fp32_int = bf16_int << 16
packed = struct.pack('>I', fp32_int)
return struct.unpack('>f', packed)[0]
# Demonstrate range preservation
val = 1e35
bf16 = fp32_to_bf16(val)
recovered = bf16_to_fp32(bf16)
print(f"Original: {val:.4e}, BF16 recovered: {recovered:.4e}")
# Both ~1e35 -- range preserved, precision lost in lower digits
Hardware support: NVIDIA A100 (Ampere) and later, Google TPUv2 and later, AMD MI250 and later, Intel Gaudi.
FP16: IEEE 754 Half Precision
FP16 is the original IEEE half-precision format. It uses 16 bits: 1 sign, 5 exponent, 10 mantissa.
Bit layout (16 bits total):
[S | EEEEE | MMMMMMMMMM]
1 5 10
- Exponent bias: 15
- Range: (max value)
- Smallest normal:
- Precision: ~3.31 decimal digits ( relative error)
FP16 has better precision than BF16 (10 vs 7 mantissa bits) but dramatically worse range (65504 vs ). The limited range is FP16’s fatal flaw for training. Any value exceeding 65504 becomes infinity. Gradient norms during training frequently exceed this threshold, requiring loss scaling to keep computations in range.
FP16 vs BF16: The Precision-Range Trade-off
| Property | FP16 | BF16 | Winner for Training |
|---|---|---|---|
| Total bits | 16 | 16 | Tie |
| Exponent bits | 5 | 8 | BF16 |
| Mantissa bits | 10 | 7 | FP16 |
| Max value | 65,504 | 3.39e38 | BF16 |
| Min normal | 6.1e-5 | 1.18e-38 | BF16 |
| Precision (decimal digits) | 3.31 | 2.15 | FP16 |
| Needs loss scaling? | Yes | No | BF16 |
| FP32 conversion cost | Rescale | Bit truncation | BF16 |
For inference, FP16 remains useful because the range limitation matters less — weights and activations are fixed, and you can calibrate scale factors to keep everything in range. The extra precision (10 vs 7 mantissa bits) can improve quality in precision-sensitive inference workloads.
Hardware support: NVIDIA Pascal (P100) and later for tensor cores, all GPUs for CUDA cores. AMD MI-series. Apple M-series. Qualcomm Adreno.
def fp32_to_fp16(value):
"""Convert FP32 to FP16 with proper exponent rebias."""
fp32_bits = struct.pack('>f', value)
fp32_int = int.from_bytes(fp32_bits, 'big')
sign = (fp32_int >> 31) & 1
exp = (fp32_int >> 23) & 0xFF
frac = fp32_int & 0x7FFFFF
# Handle special cases
if exp == 0xFF: # inf or NaN
fp16_exp = 0x1F
fp16_frac = 0 if frac == 0 else 0x200
elif exp == 0: # zero or subnormal
fp16_exp = 0
fp16_frac = 0
else:
new_exp = exp - 127 + 15 # Rebias from 127 to 15
if new_exp >= 0x1F: # Overflow to infinity
fp16_exp = 0x1F
fp16_frac = 0
elif new_exp <= 0: # Underflow to subnormal or zero
fp16_exp = 0
fp16_frac = 0
else:
fp16_exp = new_exp
fp16_frac = frac >> 13 # Keep top 10 mantissa bits
fp16_int = (sign << 15) | (fp16_exp << 10) | fp16_frac
return fp16_int
# Demonstrate range limitation
try:
large = 100000.0
fp16 = fp32_to_fp16(large)
sign = (fp16 >> 15) & 1
exp = (fp16 >> 10) & 0x1F
print(f"100000.0 -> FP16 exp={exp:#x}") # exp=0x1f = infinity
except Exception as ex:
print(f"Overflow: {ex}")
FP8 E4M3: Precision-Optimized 8-Bit Float
FP8 E4M3 uses 8 bits: 1 sign, 4 exponent, 3 mantissa.
Bit layout (8 bits total):
[S | EEEE | MMM]
1 4 3
- Exponent bias: 7
- Range: (max value; NaN encoding uses the all-ones pattern)
- Smallest normal:
- Precision: ~1.2 decimal digits ( relative error)
- Unique representable values: 240 (excluding NaN, +/-0 counted once)
E4M3 sacrifices range for precision compared to E5M2. The maximum value of 448 is modest, but 3 mantissa bits provide 8 distinct values between any two consecutive powers of two. This makes E4M3 the preferred format for forward pass computations where weight and activation values have been scaled to fit within the representable range, and you want the best possible precision for accumulation accuracy.
Unlike IEEE floats, FP8 E4M3 (as defined in the OCP spec) does not have infinity. The bit pattern that would normally encode infinity (exponent all 1s, mantissa all 0s) instead encodes the value 448. Only the single pattern S1111_111 (exponent=15, mantissa=7) encodes NaN. This gives E4M3 one more usable value compared to a strict IEEE interpretation.
def fp32_to_e4m3(value):
"""Convert FP32 to FP8 E4M3 format."""
# Clamp to representable range
max_val = 448.0
value = max(min(value, max_val), -max_val)
if value == 0.0:
return 0x00
sign = 0
if value < 0:
sign = 1
value = -value
# Find exponent: value = 2^exp * mantissa, 1.0 <= mantissa < 2.0
import math
exp = math.floor(math.log2(value))
exp_biased = exp + 7 # bias = 7
if exp_biased <= 0:
# Subnormal: exponent stored as 0, mantissa = value / 2^(-6)
exp_biased = 0
mantissa_frac = value / (2 ** -6)
mantissa_bits = int(round(mantissa_frac * (1 << 3))) & 0x07
elif exp_biased >= 15:
# Clamp to max normal (not infinity in E4M3)
exp_biased = 15
mantissa_bits = 0x06 # 448 = 2^8 * 1.75
else:
mantissa_frac = value / (2 ** exp) - 1.0 # Remove implicit 1
mantissa_bits = int(round(mantissa_frac * (1 << 3))) & 0x07
if mantissa_bits == 8: # Rounding overflow
mantissa_bits = 0
exp_biased += 1
result = (sign << 7) | (exp_biased << 3) | mantissa_bits
return result
def e4m3_to_fp32(byte_val):
"""Convert FP8 E4M3 back to FP32."""
sign = (byte_val >> 7) & 1
exp = (byte_val >> 3) & 0x0F
mantissa = byte_val & 0x07
if exp == 0x0F and mantissa == 0x07:
return float('nan')
if exp == 0:
# Subnormal
value = (mantissa / 8.0) * (2 ** -6)
else:
value = (1.0 + mantissa / 8.0) * (2 ** (exp - 7))
return -value if sign else value
# Enumerate some values
print("FP8 E4M3 sample values:")
for bits in [0x00, 0x01, 0x38, 0x3F, 0x7E, 0x7F]:
val = e4m3_to_fp32(bits)
print(f" 0x{bits:02X} -> {val}")
FP8 E5M2: Range-Optimized 8-Bit Float
FP8 E5M2 uses 8 bits: 1 sign, 5 exponent, 2 mantissa.
Bit layout (8 bits total):
[S | EEEEE | MM]
1 5 2
- Exponent bias: 15
- Range: (max finite value)
- Smallest normal:
- Precision: ~0.9 decimal digits ( relative error)
- Unique representable values: 110 (excluding inf/NaN)
E5M2 follows IEEE conventions more closely: it has infinity and NaN encodings. The 5 exponent bits give the same dynamic range as FP16. The 2 mantissa bits mean only 4 distinct values exist between consecutive powers of two — coarse precision, but the wide range is critical for backward pass computations where gradient magnitudes vary enormously.
E4M3 vs E5M2: The Precision-Range Trade-off at 8 Bits
FP8 E4M3 vs E5M2 Comparison
| Property | E4M3 | E5M2 |
|---|---|---|
| Exponent bits | 4 | 5 |
| Mantissa bits | 3 | 2 |
| Max value | 448 | 57,344 |
| Min normal | 0.015625 | 6.1e-5 |
| Values per power-of-2 | 8 | 4 |
| Has infinity? | No | Yes |
| Total unique values | 240 | 110 |
| Primary use | Forward pass | Backward pass |
The recommended practice (established by Micikevicius et al., 2022) is to use E4M3 for the forward pass and E5M2 for the backward pass. Forward pass values (weights and activations) are relatively well-bounded and benefit from precision. Backward pass values (gradients) have extreme dynamic range and benefit from the wider exponent.
def fp32_to_e5m2(value):
"""Convert FP32 to FP8 E5M2 format."""
max_val = 57344.0
if abs(value) > max_val:
# Return infinity
sign = 1 if value < 0 else 0
return (sign << 7) | 0x7C # exp=31, mantissa=0
if value == 0.0:
return 0x00
sign = 0
if value < 0:
sign = 1
value = -value
import math
exp = math.floor(math.log2(value))
exp_biased = exp + 15
if exp_biased <= 0:
exp_biased = 0
mantissa_frac = value / (2 ** -14)
mantissa_bits = int(round(mantissa_frac * 4)) & 0x03
elif exp_biased >= 31:
return (sign << 7) | 0x7C # infinity
else:
mantissa_frac = value / (2 ** exp) - 1.0
mantissa_bits = int(round(mantissa_frac * 4)) & 0x03
if mantissa_bits == 4:
mantissa_bits = 0
exp_biased += 1
return (sign << 7) | (exp_biased << 2) | mantissa_bits
def e5m2_to_fp32(byte_val):
"""Convert FP8 E5M2 back to FP32."""
sign = (byte_val >> 7) & 1
exp = (byte_val >> 2) & 0x1F
mantissa = byte_val & 0x03
if exp == 0x1F:
if mantissa == 0:
return float('-inf') if sign else float('inf')
return float('nan')
if exp == 0:
value = (mantissa / 4.0) * (2 ** -14)
else:
value = (1.0 + mantissa / 4.0) * (2 ** (exp - 15))
return -value if sign else value
# Compare representable values around 1.0
print("Values near 1.0:")
print(f" E4M3: {[e4m3_to_fp32(b) for b in range(0x38, 0x40)]}")
print(f" E5M2: {[e5m2_to_fp32(b) for b in range(0x3C, 0x40)]}")
NVFP4: NVIDIA’s 4-Bit Floating Point
NVFP4 is NVIDIA’s proprietary 4-bit floating point format, introduced with the Blackwell architecture (B100/B200). It uses 4 bits: 1 sign, 2 exponent, 1 mantissa.
Bit layout (4 bits total):
[S | EE | M]
1 2 1
- Exponent bias: 1
- Range: (max value)
- Smallest normal:
- Precision: 1 mantissa bit means only 2 values per power-of-two interval
- Unique representable values: 12 (excluding zeros)
With only 4 bits, NVFP4 can represent very few distinct values. The entire set of non-negative NVFP4 values is:
0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0
This is strikingly coarse. To make NVFP4 usable, every block of values shares a scaling factor stored in a higher precision format. A typical configuration uses one FP8 E4M3 scale factor per 16 or 32 NVFP4 elements:
This block scaling is what makes 4-bit formats viable — the scale factor shifts the representable range to wherever the actual data lives, and the 4-bit values provide relative precision within that range.
def encode_nvfp4(value):
"""Encode a single value to NVFP4 (4 bits)."""
# NVFP4 representable positive values: 0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0
nvfp4_values = [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0]
sign = 0
if value < 0:
sign = 1
value = -value
# Find nearest representable value
best_idx = 0
best_dist = abs(value - nvfp4_values[0])
for i, v in enumerate(nvfp4_values):
dist = abs(value - v)
if dist < best_dist:
best_dist = dist
best_idx = i
return (sign << 3) | best_idx
def quantize_block_nvfp4(block, block_size=16):
"""Quantize a block of FP32 values to NVFP4 with shared scale."""
import numpy as np
block = np.array(block, dtype=np.float32)
# Compute per-block scale: max absolute value / max NVFP4 value
amax = np.max(np.abs(block))
max_nvfp4 = 6.0
scale = amax / max_nvfp4 if amax > 0 else 1.0
# Quantize each element
scaled = block / scale
codes = []
for val in scaled:
codes.append(encode_nvfp4(float(val)))
return codes, scale
# Example
import numpy as np
block = np.random.randn(16).astype(np.float32) * 0.1
codes, scale = quantize_block_nvfp4(block)
print(f"Scale: {scale:.6f}")
print(f"Codes: {[f'0x{c:X}' for c in codes]}")
MXFP4: Microscaling FP4
MXFP4 is defined by the Open Compute Project (OCP) Microscaling Formats specification. It is similar in spirit to NVFP4 but uses a standardized shared exponent (microscaling) approach.
Bit layout per element (4 bits):
[S | EE | M]
1 2 1
Shared exponent per block (8 bits):
[EEEEEEEE]
8
Each block of 32 elements shares a single 8-bit exponent. The per-element 4-bit value provides the sign and a 3-bit significand (2 exponent + 1 mantissa). The shared exponent provides the coarse scale, and the per-element bits provide fine positioning within that scale.
The key difference from NVFP4 with a separate scale factor: in MXFP4, the shared exponent is part of the format specification itself. The hardware knows the block size and exponent format, enabling direct tensor core consumption without a separate dequantization step.
MXFP4 and NVFP4 both achieve 4-bit inference, but MXFP4 is an open standard (OCP) while NVFP4 is NVIDIA-proprietary. On Blackwell, both formats run at the same tensor core throughput. The practical difference is in the scaling granularity: MXFP4 mandates a specific block structure (32 elements per shared exponent), while NVFP4 allows flexible block sizes.
def quantize_block_mxfp4(block):
"""Quantize a 32-element block to MXFP4 format.
Returns (shared_exponent, element_codes) where each code is 4 bits.
"""
import numpy as np
import math
block = np.array(block, dtype=np.float32)
# Compute shared exponent from max absolute value
amax = np.max(np.abs(block))
if amax == 0:
return 0, [0] * len(block)
# Shared exponent: floor(log2(amax))
shared_exp = int(math.floor(math.log2(amax)))
# Clamp to 8-bit unsigned range
shared_exp = max(0, min(255, shared_exp + 127)) # Bias = 127
# Scale block by shared exponent
scale = 2.0 ** (shared_exp - 127)
scaled = block / scale
# Quantize each element to 4-bit: sign(1) + exp(2) + mantissa(1)
mxfp4_table = [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0]
codes = []
for val in scaled:
sign = 0
if val < 0:
sign = 1
val = -val
# Round to nearest MXFP4 value
best_idx = min(range(len(mxfp4_table)),
key=lambda i: abs(val - mxfp4_table[i]))
codes.append((sign << 3) | best_idx)
return shared_exp, codes
def dequantize_block_mxfp4(shared_exp, codes):
"""Dequantize MXFP4 block back to FP32."""
mxfp4_table = [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0]
scale = 2.0 ** (shared_exp - 127)
values = []
for code in codes:
sign = (code >> 3) & 1
idx = code & 0x07
val = mxfp4_table[idx] * scale
values.append(-val if sign else val)
return values
# Round-trip test
block = np.random.randn(32).astype(np.float32)
exp, codes = quantize_block_mxfp4(block)
recovered = dequantize_block_mxfp4(exp, codes)
mse = np.mean((block - np.array(recovered)) ** 2)
print(f"Shared exponent: {exp}, MSE: {mse:.6f}")
INT8: 8-Bit Integer
INT8 is the simplest quantized format. It uses 8 bits as a signed integer with no exponent or mantissa — just a linear mapping from integers to real values via a scale factor and zero point.
Bit layout (8 bits):
[SSSSSSSS] (two's complement signed integer)
Range: -128 to 127
The mapping from INT8 to real values is:
INT8 quantization is uniform: the spacing between consecutive representable values is constant (equal to the scale factor). This contrasts with floating-point formats where spacing increases with magnitude.
Uniform quantization is poorly suited to distributions with outliers. If your data has a few values at 100x the typical magnitude, the scale factor must accommodate them, leaving only ~1% of the integer range for the typical values. This is the fundamental motivation behind SmoothQuant and other outlier-handling techniques covered in Part 3.
def quantize_symmetric_int8(tensor, per_channel=True):
"""Symmetric INT8 quantization (zero_point = 0)."""
import numpy as np
tensor = np.array(tensor, dtype=np.float32)
if per_channel:
# Scale per output channel (axis 0)
amax = np.max(np.abs(tensor), axis=1, keepdims=True)
else:
# Single scale for entire tensor
amax = np.max(np.abs(tensor))
scale = amax / 127.0
scale = np.where(scale == 0, 1.0, scale) # Avoid division by zero
quantized = np.clip(np.round(tensor / scale), -128, 127).astype(np.int8)
return quantized, scale
def dequantize_int8(quantized, scale):
"""Dequantize INT8 back to FP32."""
return quantized.astype(np.float32) * scale
# Example: quantize a weight matrix
weight = np.random.randn(64, 256).astype(np.float32) * 0.02
q_weight, scale = quantize_symmetric_int8(weight, per_channel=True)
recon = dequantize_int8(q_weight, scale)
mse = np.mean((weight - recon) ** 2)
print(f"INT8 quantization MSE: {mse:.8f}")
print(f"Scale shape: {scale.shape}") # (64, 1) for per-channel
Hardware support: NVIDIA Turing (T4) and later for INT8 tensor cores. All GPUs support INT8 via CUDA cores. AMD MI-series. Intel Gaudi. Qualcomm Hexagon DSP.
INT4: 4-Bit Integer
INT4 uses 4 bits as either a signed integer (-8 to 7) or unsigned integer (0 to 15). In practice, signed INT4 with symmetric quantization is most common for weight quantization.
Bit layout (4 bits):
[SSSS] (signed: -8 to 7, unsigned: 0 to 15)
With only 16 distinct values, INT4 is extremely coarse. Per-group quantization (one scale factor per group of 32-128 weights) is essential to maintain quality.
where indexes the group that the element belongs to.
def quantize_int4_grouped(tensor, group_size=128):
"""INT4 quantization with per-group scaling."""
import numpy as np
tensor = np.array(tensor, dtype=np.float32)
rows, cols = tensor.shape
# Pad columns to multiple of group_size
pad = (group_size - cols % group_size) % group_size
if pad > 0:
tensor = np.pad(tensor, ((0, 0), (0, pad)))
cols_padded = tensor.shape[1]
num_groups = cols_padded // group_size
# Reshape to (rows, num_groups, group_size)
grouped = tensor.reshape(rows, num_groups, group_size)
# Per-group scale
amax = np.max(np.abs(grouped), axis=2, keepdims=True)
scale = amax / 7.0 # Symmetric: map to [-7, 7]
scale = np.where(scale == 0, 1.0, scale)
quantized = np.clip(np.round(grouped / scale), -8, 7).astype(np.int8)
return quantized, scale, cols
def dequantize_int4_grouped(quantized, scale, original_cols):
"""Dequantize INT4 groups back to FP32."""
recon = quantized.astype(np.float32) * scale
rows = recon.shape[0]
recon = recon.reshape(rows, -1)[:, :original_cols]
return recon
# Example: group_size=128
weight = np.random.randn(256, 1024).astype(np.float32) * 0.01
q, s, orig_cols = quantize_int4_grouped(weight, group_size=128)
recon = dequantize_int4_grouped(q, s, orig_cols)
mse = np.mean((weight - recon) ** 2)
print(f"INT4 g128 MSE: {mse:.8f}")
GPU Hardware Support Matrix
Number Format Hardware Support by GPU Architecture
| Format | V100 (Volta) | T4 (Turing) | A100 (Ampere) | H100 (Hopper) | B200 (Blackwell) |
|---|---|---|---|---|---|
| FP32 | Yes | Yes | Yes | Yes | Yes |
| FP16 Tensor | Yes | Yes | Yes | Yes | Yes |
| BF16 Tensor | No | No | Yes | Yes | Yes |
| INT8 Tensor | No | Yes | Yes | Yes | Yes |
| FP8 E4M3 | No | No | No | Yes | Yes |
| FP8 E5M2 | No | No | No | Yes | Yes |
| INT4 Tensor | No | No | No | No | Yes |
| NVFP4 | No | No | No | No | Yes |
| MXFP4 | No | No | No | No | Yes |
Peak Tensor Core TOPS by Format (Single GPU)
(TOPS)Comprehensive Format Comparison
All AI Number Formats at a Glance
| Format | Bits | Sign | Exponent | Mantissa | Max Value | Precision (decimal) | Values per 2x interval |
|---|---|---|---|---|---|---|---|
| FP32 | 32 | 1 | 8 | 23 | 3.4e38 | 7.2 | 8,388,608 |
| BF16 | 16 | 1 | 8 | 7 | 3.4e38 | 2.2 | 128 |
| FP16 | 16 | 1 | 5 | 10 | 65,504 | 3.3 | 1,024 |
| FP8 E4M3 | 8 | 1 | 4 | 3 | 448 | 1.2 | 8 |
| FP8 E5M2 | 8 | 1 | 5 | 2 | 57,344 | 0.9 | 4 |
| NVFP4 | 4 | 1 | 2 | 1 | 6.0 | N/A | 2 |
| MXFP4 | 4+shared | 1 | 2 | 1 | block-scaled | N/A | 2 |
| INT8 | 8 | N/A | N/A | N/A | 127 (uniform) | N/A | N/A (uniform) |
| INT4 | 4 | N/A | N/A | N/A | 7 (uniform) | N/A | N/A (uniform) |
Putting It All Together: Universal Format Converter
Here is a complete Python module that converts between all formats:
import numpy as np
import struct
class NumberFormatConverter:
"""Convert between AI number formats."""
@staticmethod
def fp32_to_bf16_array(arr):
"""Vectorized FP32 to BF16."""
arr = np.asarray(arr, dtype=np.float32)
int32 = arr.view(np.uint32)
# Round to nearest even
rounding = np.uint32(0x7FFF) + ((int32 >> 16) & np.uint32(1))
bf16 = ((int32 + rounding) >> 16).astype(np.uint16)
return bf16
@staticmethod
def bf16_to_fp32_array(bf16_arr):
"""Vectorized BF16 to FP32."""
bf16_arr = np.asarray(bf16_arr, dtype=np.uint16)
fp32_int = bf16_arr.astype(np.uint32) << 16
return fp32_int.view(np.float32)
@staticmethod
def quantize_e4m3(arr, block_size=32):
"""Quantize FP32 array to E4M3 with per-block scaling."""
arr = np.asarray(arr, dtype=np.float32).flatten()
n = len(arr)
pad = (block_size - n % block_size) % block_size
if pad:
arr = np.concatenate([arr, np.zeros(pad, dtype=np.float32)])
num_blocks = len(arr) // block_size
blocks = arr.reshape(num_blocks, block_size)
amax = np.max(np.abs(blocks), axis=1, keepdims=True)
scales = amax / 448.0
scales = np.where(scales == 0, 1.0, scales)
scaled = blocks / scales
# Map to nearest E4M3 value using lookup table
e4m3_pos = np.array([
0.0, 0.001953125, 0.00390625, 0.005859375, # subnormals
0.0078125, 0.009765625, 0.01171875, 0.013671875, # ... many values
0.015625, 0.01953125, 0.0234375, 0.02734375,
0.03125, 0.0390625, 0.046875, 0.0546875,
0.0625, 0.078125, 0.09375, 0.109375,
0.125, 0.15625, 0.1875, 0.21875,
0.25, 0.3125, 0.375, 0.4375,
0.5, 0.625, 0.75, 0.875,
1.0, 1.25, 1.5, 1.75,
2.0, 2.5, 3.0, 3.5,
4.0, 5.0, 6.0, 7.0,
8.0, 10.0, 12.0, 14.0,
16.0, 20.0, 24.0, 28.0,
32.0, 40.0, 48.0, 56.0,
64.0, 80.0, 96.0, 112.0,
128.0, 160.0, 192.0, 224.0,
256.0, 320.0, 384.0, 448.0,
], dtype=np.float32)
codes = np.zeros_like(scaled, dtype=np.uint8)
for i in range(num_blocks):
for j in range(block_size):
val = scaled[i, j]
sign = 0
if val < 0:
sign = 1
val = -val
idx = np.argmin(np.abs(e4m3_pos - val))
codes[i, j] = (sign << 7) | idx
return codes[:, :], scales, n
@staticmethod
def measure_error(original, reconstructed):
"""Compute quantization error metrics."""
mse = np.mean((original - reconstructed) ** 2)
rmse = np.sqrt(mse)
max_err = np.max(np.abs(original - reconstructed))
snr = 10 * np.log10(np.mean(original ** 2) / (mse + 1e-10))
return {
'mse': float(mse),
'rmse': float(rmse),
'max_error': float(max_err),
'snr_db': float(snr),
}
# Compare quantization error across formats
data = np.random.randn(1024).astype(np.float32) * 0.1
# BF16
bf16 = NumberFormatConverter.fp32_to_bf16_array(data)
bf16_recon = NumberFormatConverter.bf16_to_fp32_array(bf16)
print("BF16:", NumberFormatConverter.measure_error(data, bf16_recon))
# INT8
scale = np.max(np.abs(data)) / 127.0
int8_q = np.clip(np.round(data / scale), -128, 127).astype(np.int8)
int8_recon = int8_q.astype(np.float32) * scale
print("INT8:", NumberFormatConverter.measure_error(data, int8_recon))
# INT4
scale4 = np.max(np.abs(data)) / 7.0
int4_q = np.clip(np.round(data / scale4), -8, 7).astype(np.int8)
int4_recon = int4_q.astype(np.float32) * scale4
print("INT4:", NumberFormatConverter.measure_error(data, int4_recon))
Choosing the Right Format
The decision tree for format selection in practice:
- Training master weights and optimizer states: FP32. No compromise.
- Training forward/backward compute: BF16 by default. FP8 E4M3/E5M2 on Hopper and later if training is stable at this precision (see Part 4).
- Inference weights, quality-sensitive: FP8 E4M3 or INT8. Minimal quality loss, 2x memory savings over FP16.
- Inference weights, throughput-optimized: INT4 with per-group scaling (GPTQ/AWQ, see Part 2). 4x memory savings.
- Inference weights, maximum throughput on Blackwell: NVFP4 or MXFP4. 2x throughput over FP8 tensor cores.
- KV cache: FP8 for minimal quality loss, INT4 for maximum memory savings (see Part 6).
The rest of this series will show you exactly how to implement each of these, starting with weight quantization algorithms in the next post.