Tensor cores are the specialized matrix multiply-accumulate units on NVIDIA GPUs. They perform small matrix multiplications (e.g., 8x8x4 FP16) in a single operation, achieving throughput that is 4-16x higher than standard CUDA cores for the same silicon area and power. Every generation since Volta (2017) has expanded tensor core capabilities: new data types, larger matrix tiles, higher throughput, and new programming interfaces.
This post traces the precise changes at each generation — what the tensor core physically computes, how the programming model evolved from WMMA to MMA to WGMMA, and what each change means for kernel performance. The throughput progression from Volta (125 TFLOPS FP16) to Blackwell (9000 TFLOPS FP4) represents a 72x increase in 8 years, driven almost entirely by tensor core improvements.
Volta (2017): The First Tensor Cores
Architecture
Volta (V100, SM version 7.0) introduced the first tensor cores. Each SM has 8 tensor cores (Volta) that perform 4x4x4 FP16 matrix multiply-accumulate operations per cycle:
At the programming level, the smallest addressable operation is a 16x16x16 matrix multiply across a warp (32 threads), using the WMMA (Warp Matrix Multiply Accumulate) API.
#include <mma.h>
using namespace nvcuda;
// Volta WMMA: 16x16x16 FP16 matrix multiply
__global__ void volta_tensor_core_gemm(
const half* A, const half* B, float* C,
int M, int N, int K
) {
// Declare fragments
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> frag_a;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> frag_b;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> frag_c;
// Initialize accumulator
wmma::fill_fragment(frag_c, 0.0f);
// Load A and B fragments
wmma::load_matrix_sync(frag_a, A, K); // A: [16, K], stride=K
wmma::load_matrix_sync(frag_b, B, K); // B: [K, 16], stride=K
// Tensor core multiply-accumulate
wmma::mma_sync(frag_c, frag_a, frag_b, frag_c);
// Store result
wmma::store_matrix_sync(C, frag_c, N, wmma::mem_row_major);
}
Volta Tensor Core Specifications
Volta (V100 SXM2) Tensor Core Specifications
| Specification | Value |
|---|---|
| Tensor cores per SM | 8 |
| SMs per GPU | 80 |
| Total tensor cores | 640 |
| Operation per core per cycle | 4x4x4 FP16 MMA |
| FP16 throughput (with FP32 accumulate) | 125 TFLOPS |
| Supported formats | FP16 only |
| Accumulator format | FP16 or FP32 |
| API | WMMA (warp-level, 32 threads) |
| Minimum tile size | 16x16x16 |
| Clock (boost) | 1530 MHz |
Turing (2018): Integer Tensor Cores
What Changed
Turing (T4, RTX 2080, SM 7.5) added INT8 and INT4 support to tensor cores. This was the first hardware support for quantized inference on tensor cores.
// Turing INT8 WMMA: 8x32x16
// Different tile sizes for integer types
wmma::fragment<wmma::matrix_a, 8, 32, 16, signed char, wmma::row_major> frag_a;
wmma::fragment<wmma::matrix_b, 8, 32, 16, signed char, wmma::col_major> frag_b;
wmma::fragment<wmma::accumulator, 8, 32, 16, int> frag_c;
// Accumulator is INT32 (to avoid overflow)
wmma::fill_fragment(frag_c, 0);
wmma::mma_sync(frag_c, frag_a, frag_b, frag_c);
New Capabilities
Turing (T4) vs Volta (V100) Tensor Cores
| Capability | Volta (V100) | Turing (T4) |
|---|---|---|
| FP16 TFLOPS | 125 | 65 |
| INT8 TOPS | --- | 130 |
| INT4 TOPS | --- | 260 |
| INT1 (Binary) TOPS | --- | 520 |
| Tensor cores per SM | 8 | 8 |
| SMs | 80 | 40 |
| TDP | 300W | 70W |
| INT8 TOPS/watt | --- | 1.86 |
Before Turing, INT8 inference required dequantizing to FP16 and running through FP16 tensor cores. Turing’s INT8 tensor cores directly multiply INT8 inputs with INT32 accumulation, eliminating the dequantization overhead and providing 2x throughput over FP16 for the same tensor core count.
Ampere (2020): TF32, BF16, and Structured Sparsity
Three Major Additions
Ampere (A100, SM 8.0) introduced three capabilities:
-
TF32 (TensorFloat-32): A 19-bit format (8-bit exponent, 10-bit mantissa) that provides FP32 range with reduced precision. TF32 tensor cores accept FP32 inputs and internally truncate to TF32 — no code change required. This gives FP32 code 8x speedup on tensor cores.
-
BF16 (bfloat16): 16-bit floating point with 8-bit exponent (same range as FP32) and 7-bit mantissa. Preferred for training because its range matches FP32, avoiding overflow issues that plague FP16.
-
Structured Sparsity (2:4): Hardware support for 50% sparse matrices where exactly 2 out of every 4 consecutive elements are zero. The tensor core skips the zero elements, doubling throughput.
// Ampere MMA (lower-level API, replaces WMMA for power users)
// m16n8k16 FP16 MMA instruction
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
"{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
: "=f"(d0), "=f"(d1), "=f"(d2), "=f"(d3)
: "r"(a0), "r"(a1), "r"(a2), "r"(a3),
"r"(b0), "r"(b1),
"f"(c0), "f"(c1), "f"(c2), "f"(c3)
);
// TF32: same code as FP32, but with TF32 enabled (default in PyTorch)
// torch.backends.cuda.matmul.allow_tf32 = True
// No code change needed -- the tensor core silently truncates FP32 to TF32
Structured Sparsity
def structured_sparsity_explanation():
"""Explain Ampere's 2:4 structured sparsity."""
# 2:4 pattern: in every group of 4 consecutive elements,
# exactly 2 are zero.
# Example:
# Dense: [1.2, 0.0, 3.4, 0.0, 5.6, 0.0, 7.8, 0.0]
# This is 2:4 sparse: pairs (1.2, 3.4), (5.6, 7.8)
#
# The hardware stores:
# - Values: [1.2, 3.4, 5.6, 7.8] (50% of original)
# - Index: [0, 2, 0, 2] (2-bit index per value)
#
# The tensor core uses the index to reconstruct the
# sparse multiplication without materializing zeros.
dense_params = 4096 * 4096
sparse_params = dense_params // 2 # 50% pruned
metadata_bits = sparse_params * 2 # 2-bit index per value
print(f"Dense parameters: {dense_params:,}")
print(f"Sparse parameters: {sparse_params:,}")
print(f"Metadata: {metadata_bits / 8 / 1e6:.1f} MB "
f"(negligible overhead)")
print(f"Throughput: 2x dense tensor core TFLOPS")
Ampere (A100 SXM) Tensor Core Throughput
| Format | TFLOPS/TOPS | With 2:4 Sparsity |
|---|---|---|
| FP64 (new for A100) | 19.5 | --- |
| TF32 (new) | 156 | 312 |
| BF16 (new) | 312 | 624 |
| FP16 | 312 | 624 |
| INT8 | 624 | 1248 |
| INT4 | 1248 | 2496 |
Hopper (2022): FP8, WGMMA, and TMA
The Biggest Architectural Leap
Hopper (H100, SM 9.0) introduced three major changes that fundamentally altered how tensor cores are programmed:
-
FP8 tensor cores: Native support for 8-bit floating point (E4M3 and E5M2 formats), doubling throughput over FP16.
-
WGMMA (Warp Group Matrix Multiply Accumulate): A new instruction that operates on a warp group of 128 threads (4 warps) instead of a single warp (32 threads). This provides larger tile sizes and better utilization.
-
TMA (Tensor Memory Accelerator): A hardware DMA engine that asynchronously loads tensor tiles from global memory to shared memory, decoupling data movement from compute.
// Hopper WGMMA: operates on 128-thread warp groups
// This is a simplified illustration -- actual WGMMA uses
// the CUDA 12.x+ MMA API or inline PTX
// PTX for WGMMA (Hopper, sm_90)
// wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16
asm volatile(
"wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16 "
"{%0,%1,%2,...}, " // 128 output registers (across 128 threads)
"desc_a, desc_b, " // Tensor descriptors (not raw pointers)
"1, 1, 1;\n" // scale_d, imm_scale_a, imm_scale_b
: "=f"(d0), "=f"(d1), ...
:
);
WGMMA vs MMA: 128 Threads Instead of 32
def wgmma_vs_mma_analysis():
"""Compare Hopper WGMMA with Ampere MMA."""
comparison = {
"Ampere MMA (m16n8k16)": {
"threads": 32,
"tile_m": 16,
"tile_n": 8,
"tile_k": 16,
"output_elements": 16 * 8, # 128
"output_per_thread": 128 / 32, # 4
},
"Hopper WGMMA (m64n256k16)": {
"threads": 128,
"tile_m": 64,
"tile_n": 256,
"tile_k": 16,
"output_elements": 64 * 256, # 16384
"output_per_thread": 16384 / 128, # 128
},
}
for name, spec in comparison.items():
print(f"\n{name}:")
for key, val in spec.items():
print(f" {key}: {val}")
# WGMMA produces 128x more output per instruction
# This reduces instruction overhead and improves scheduling
ratio = 16384 / 128
print(f"\nOutput per instruction: WGMMA produces {ratio:.0f}x more "
f"elements than MMA")
TMA: Hardware Tensor Memory Accelerator
// TMA loads a tensor tile from global to shared memory
// without using any threads -- it is a hardware DMA engine
// Create TMA descriptor (host-side, once)
CUtensorMap tma_desc;
cuTensorMapEncodeTiled(
&tma_desc,
CU_TENSOR_MAP_DATA_TYPE_FLOAT16,
2, // num_dims
global_ptr, // global memory base
dims, // global tensor dimensions
strides, // global tensor strides
box_dims, // tile dimensions to load
element_strides,
CU_TENSOR_MAP_INTERLEAVE_NONE,
CU_TENSOR_MAP_SWIZZLE_128B,
CU_TENSOR_MAP_L2_PROMOTION_L2_256B,
CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE
);
// In the kernel: initiate TMA load (single thread issues the load,
// hardware moves the entire tile)
if (threadIdx.x == 0) {
// Asynchronous: returns immediately, tile arrives in shared mem later
cp_async_bulk_tensor_2d_global_to_shared(
shared_ptr, // destination in shared memory
&tma_desc, // TMA descriptor
coord_x, // tile coordinates in the global tensor
coord_y,
barrier // arrival barrier
);
}
// Wait for tile to arrive
barrier.wait(); // All threads wait for the DMA to complete
Without TMA, threads must issue load instructions to move data from global to shared memory. With TMA, a single thread initiates the transfer and the hardware DMA engine handles it. The remaining 127 threads (in a warp group) can continue computing on previously loaded tiles. This enables true overlap of data movement and computation within a single kernel.
Hopper (H100 SXM) Tensor Core Throughput
| Format | Dense TFLOPS/TOPS | 2:4 Sparse TFLOPS/TOPS |
|---|---|---|
| FP64 | 67 | 134 |
| TF32 | 495 | 990 |
| BF16 | 990 | 1979 |
| FP16 | 990 | 1979 |
| FP8 (E4M3) | 1979 | 3958 |
| FP8 (E5M2) | 1979 | 3958 |
| INT8 | 1979 | 3958 |
Blackwell (2024): FP4 and Second-Gen Transformer Engine
FP4: Four-Bit Floating Point
Blackwell (B200/GB200, SM 10.0) introduces native FP4 tensor cores with a micro-scaling format:
- FP4 E2M1: 2 exponent bits, 1 mantissa bit, 1 sign bit. Only 16 representable values per sign (including zero).
- Micro-scaling: A shared 8-bit E8M0 scale factor per small block (typically 16 or 32 elements). This provides much higher dynamic range than the 4-bit format alone.
- NVFP4: NVIDIA’s FP4 format with per-block E8M0 scales, designed for LLM inference.
def fp4_format_analysis():
"""Analyze FP4 E2M1 representable values."""
# FP4 E2M1: sign(1) + exponent(2) + mantissa(1)
# Exponent bias: 1
# Normal values: (-1)^s * 2^(e-1) * (1 + m/2)
# Special: e=0 is subnormal, e=3 is max normal (no inf/nan)
values = set()
for sign in [1, -1]:
for e in range(4): # 2-bit exponent: 0-3
for m in range(2): # 1-bit mantissa: 0-1
if e == 0:
# Subnormal
val = sign * (2 ** (1 - 1)) * (m / 2)
else:
val = sign * (2 ** (e - 1)) * (1 + m / 2)
values.add(val)
sorted_vals = sorted(values)
print(f"FP4 E2M1 representable values ({len(sorted_vals)} total):")
print(f" {sorted_vals}")
print(f" Range: [{min(sorted_vals)}, {max(sorted_vals)}]")
print(f" With E8M0 micro-scale: range extends to "
f"+/- {max(sorted_vals) * 2**127:.0e}")
fp4_format_analysis()
Blackwell Throughput
Blackwell (B200 SXM) Tensor Core Throughput (Estimated)
| Format | Dense TFLOPS/TOPS | 2:4 Sparse TFLOPS/TOPS | vs H100 Dense |
|---|---|---|---|
| FP64 | 90 | 180 | 1.34x |
| TF32 | 1125 | 2250 | 2.27x |
| BF16 | 2250 | 4500 | 2.27x |
| FP16 | 2250 | 4500 | 2.27x |
| FP8 (E4M3) | 4500 | 9000 | 2.27x |
| FP4 (E2M1) | 9000 | 18000 | --- |
| FP6 (E3M2/E2M3) | 4500 | 9000 | --- |
| INT8 | 4500 | 9000 | 2.27x |
Throughput Progression Across Generations
Tensor Core Throughput Evolution (TFLOPS/TOPS, dense)
(TFLOPS)Runtime Detection of Tensor Core Support
Querying Device Capabilities
#include <cuda_runtime.h>
#include <cstdio>
struct TensorCoreSupport {
bool fp16;
bool bf16;
bool tf32;
bool fp8;
bool fp4;
bool int8;
bool int4;
bool sparse_2_4;
int sm_version;
void detect(int device_id = 0) {
cudaDeviceProp props;
cudaGetDeviceProperties(&props, device_id);
sm_version = props.major * 10 + props.minor;
// Tensor core availability by SM version
fp16 = (sm_version >= 70); // Volta+
int8 = (sm_version >= 75); // Turing+
int4 = (sm_version >= 75); // Turing+
bf16 = (sm_version >= 80); // Ampere+
tf32 = (sm_version >= 80); // Ampere+
sparse_2_4 = (sm_version >= 80); // Ampere+
fp8 = (sm_version >= 90); // Hopper+
fp4 = (sm_version >= 100); // Blackwell+
printf("GPU: %s (SM %d.%d)\n",
props.name, props.major, props.minor);
printf("Tensor Core Support:\n");
printf(" FP16: %s\n", fp16 ? "Yes" : "No");
printf(" BF16: %s\n", bf16 ? "Yes" : "No");
printf(" TF32: %s\n", tf32 ? "Yes" : "No");
printf(" FP8: %s\n", fp8 ? "Yes" : "No");
printf(" FP4: %s\n", fp4 ? "Yes" : "No");
printf(" INT8: %s\n", int8 ? "Yes" : "No");
printf(" INT4: %s\n", int4 ? "Yes" : "No");
printf(" 2:4 Sparse: %s\n", sparse_2_4 ? "Yes" : "No");
}
};
Python Detection
import torch
def detect_tensor_core_support():
"""Detect tensor core capabilities of the current GPU."""
if not torch.cuda.is_available():
print("No CUDA GPU available")
return
props = torch.cuda.get_device_properties(0)
sm = props.major * 10 + props.minor
capabilities = {
'GPU': props.name,
'SM Version': f'{props.major}.{props.minor}',
'FP16 Tensor Cores (Volta+)': sm >= 70,
'INT8 Tensor Cores (Turing+)': sm >= 75,
'BF16/TF32 Tensor Cores (Ampere+)': sm >= 80,
'2:4 Structured Sparsity (Ampere+)': sm >= 80,
'FP8 Tensor Cores (Hopper+)': sm >= 90,
'WGMMA Instructions (Hopper+)': sm >= 90,
'TMA Hardware (Hopper+)': sm >= 90,
'FP4 Tensor Cores (Blackwell+)': sm >= 100,
}
for capability, supported in capabilities.items():
status = "Supported" if supported is True else (
supported if isinstance(supported, str) else "Not available")
print(f" {capability}: {status}")
# Recommended quantization strategy based on hardware
if sm >= 100:
rec = "FP4 weights, FP8 activations (NVFP4)"
elif sm >= 90:
rec = "FP8 weights and activations (Transformer Engine)"
elif sm >= 80:
rec = "INT8 W8A8 (SmoothQuant) or W4A16 (GPTQ/AWQ)"
elif sm >= 75:
rec = "INT8 W8A8 or W4A16"
elif sm >= 70:
rec = "FP16 (no quantization benefit from tensor cores)"
else:
rec = "FP32 (no tensor cores)"
print(f"\n Recommended quantization: {rec}")
detect_tensor_core_support()
Summary
Tensor core evolution follows a clear pattern: each generation adds lower-precision formats (FP16 to INT8 to FP8 to FP4) and increases the tile size of the basic matrix operation (16x16x16 per warp to 64x256x16 per warp group). The programming model evolved from WMMA (simple, warp-level) to MMA (low-level, warp-level) to WGMMA (warp-group-level with TMA integration). The throughput progression from Volta to Blackwell is 72x for the lowest supported precision at each generation, driven primarily by smaller numeric formats (which pack more operations per silicon area) and larger tile sizes (which improve utilization).
For LLM inference, the practical impact: H100 FP8 provides 2x throughput over FP16 at minimal quality loss. Blackwell FP4 will extend this to 4x over FP8 — if the quantization quality holds. The hardware is always ahead of the software: FP8 hardware shipped in 2022, but production FP8 inference only became mainstream in 2024. FP4 hardware ships in 2024, and production FP4 inference is likely 1-2 years away.