Part of Series GPU Hardware & AI Accelerators 7 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

Tensor cores are the specialized matrix multiply-accumulate units on NVIDIA GPUs. They perform small matrix multiplications (e.g., 8x8x4 FP16) in a single operation, achieving throughput that is 4-16x higher than standard CUDA cores for the same silicon area and power. Every generation since Volta (2017) has expanded tensor core capabilities: new data types, larger matrix tiles, higher throughput, and new programming interfaces.

This post traces the precise changes at each generation — what the tensor core physically computes, how the programming model evolved from WMMA to MMA to WGMMA, and what each change means for kernel performance. The throughput progression from Volta (125 TFLOPS FP16) to Blackwell (9000 TFLOPS FP4) represents a 72x increase in 8 years, driven almost entirely by tensor core improvements.

Volta (2017): The First Tensor Cores

Architecture

Volta (V100, SM version 7.0) introduced the first tensor cores. Each SM has 8 tensor cores (Volta) that perform 4x4x4 FP16 matrix multiply-accumulate operations per cycle:

D4×4=A4×4×B4×4+C4×4D_{4 \times 4} = A_{4 \times 4} \times B_{4 \times 4} + C_{4 \times 4}

At the programming level, the smallest addressable operation is a 16x16x16 matrix multiply across a warp (32 threads), using the WMMA (Warp Matrix Multiply Accumulate) API.

#include <mma.h>
using namespace nvcuda;

// Volta WMMA: 16x16x16 FP16 matrix multiply
__global__ void volta_tensor_core_gemm(
    const half* A, const half* B, float* C,
    int M, int N, int K
) {
    // Declare fragments
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> frag_a;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> frag_b;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> frag_c;

    // Initialize accumulator
    wmma::fill_fragment(frag_c, 0.0f);

    // Load A and B fragments
    wmma::load_matrix_sync(frag_a, A, K);  // A: [16, K], stride=K
    wmma::load_matrix_sync(frag_b, B, K);  // B: [K, 16], stride=K

    // Tensor core multiply-accumulate
    wmma::mma_sync(frag_c, frag_a, frag_b, frag_c);

    // Store result
    wmma::store_matrix_sync(C, frag_c, N, wmma::mem_row_major);
}

Volta Tensor Core Specifications

📊

Volta (V100 SXM2) Tensor Core Specifications

SpecificationValue
Tensor cores per SM 8
SMs per GPU 80
Total tensor cores 640
Operation per core per cycle 4x4x4 FP16 MMA
FP16 throughput (with FP32 accumulate) 125 TFLOPS
Supported formats FP16 only
Accumulator format FP16 or FP32
API WMMA (warp-level, 32 threads)
Minimum tile size 16x16x16
Clock (boost) 1530 MHz
Note: Volta tensor cores only support FP16 input. The 125 TFLOPS is 8x the FP32 throughput of CUDA cores on the same chip (15.7 TFLOPS FP32).

Turing (2018): Integer Tensor Cores

What Changed

Turing (T4, RTX 2080, SM 7.5) added INT8 and INT4 support to tensor cores. This was the first hardware support for quantized inference on tensor cores.

// Turing INT8 WMMA: 8x32x16
// Different tile sizes for integer types
wmma::fragment<wmma::matrix_a, 8, 32, 16, signed char, wmma::row_major> frag_a;
wmma::fragment<wmma::matrix_b, 8, 32, 16, signed char, wmma::col_major> frag_b;
wmma::fragment<wmma::accumulator, 8, 32, 16, int> frag_c;

// Accumulator is INT32 (to avoid overflow)
wmma::fill_fragment(frag_c, 0);
wmma::mma_sync(frag_c, frag_a, frag_b, frag_c);

New Capabilities

📊

Turing (T4) vs Volta (V100) Tensor Cores

CapabilityVolta (V100)Turing (T4)
FP16 TFLOPS 125 65
INT8 TOPS --- 130
INT4 TOPS --- 260
INT1 (Binary) TOPS --- 520
Tensor cores per SM 8 8
SMs 80 40
TDP 300W 70W
INT8 TOPS/watt --- 1.86
Note: T4 has fewer SMs and lower TDP than V100, but adds INT8/INT4 support. The T4 at 70W is designed for inference in data centers with power constraints.
ℹ️ Turing's Integer Tensor Cores Enabled Quantized Inference

Before Turing, INT8 inference required dequantizing to FP16 and running through FP16 tensor cores. Turing’s INT8 tensor cores directly multiply INT8 inputs with INT32 accumulation, eliminating the dequantization overhead and providing 2x throughput over FP16 for the same tensor core count.

Ampere (2020): TF32, BF16, and Structured Sparsity

Three Major Additions

Ampere (A100, SM 8.0) introduced three capabilities:

  1. TF32 (TensorFloat-32): A 19-bit format (8-bit exponent, 10-bit mantissa) that provides FP32 range with reduced precision. TF32 tensor cores accept FP32 inputs and internally truncate to TF32 — no code change required. This gives FP32 code 8x speedup on tensor cores.

  2. BF16 (bfloat16): 16-bit floating point with 8-bit exponent (same range as FP32) and 7-bit mantissa. Preferred for training because its range matches FP32, avoiding overflow issues that plague FP16.

  3. Structured Sparsity (2:4): Hardware support for 50% sparse matrices where exactly 2 out of every 4 consecutive elements are zero. The tensor core skips the zero elements, doubling throughput.

// Ampere MMA (lower-level API, replaces WMMA for power users)
// m16n8k16 FP16 MMA instruction
asm volatile(
    "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
    "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
    : "=f"(d0), "=f"(d1), "=f"(d2), "=f"(d3)
    : "r"(a0), "r"(a1), "r"(a2), "r"(a3),
      "r"(b0), "r"(b1),
      "f"(c0), "f"(c1), "f"(c2), "f"(c3)
);

// TF32: same code as FP32, but with TF32 enabled (default in PyTorch)
// torch.backends.cuda.matmul.allow_tf32 = True
// No code change needed -- the tensor core silently truncates FP32 to TF32

Structured Sparsity

def structured_sparsity_explanation():
    """Explain Ampere's 2:4 structured sparsity."""
    # 2:4 pattern: in every group of 4 consecutive elements,
    # exactly 2 are zero.
    # Example:
    # Dense:  [1.2, 0.0, 3.4, 0.0, 5.6, 0.0, 7.8, 0.0]
    # This is 2:4 sparse: pairs (1.2, 3.4), (5.6, 7.8)
    #
    # The hardware stores:
    # - Values: [1.2, 3.4, 5.6, 7.8] (50% of original)
    # - Index:  [0, 2, 0, 2] (2-bit index per value)
    #
    # The tensor core uses the index to reconstruct the
    # sparse multiplication without materializing zeros.

    dense_params = 4096 * 4096
    sparse_params = dense_params // 2  # 50% pruned
    metadata_bits = sparse_params * 2  # 2-bit index per value

    print(f"Dense parameters: {dense_params:,}")
    print(f"Sparse parameters: {sparse_params:,}")
    print(f"Metadata: {metadata_bits / 8 / 1e6:.1f} MB "
          f"(negligible overhead)")
    print(f"Throughput: 2x dense tensor core TFLOPS")
📊

Ampere (A100 SXM) Tensor Core Throughput

FormatTFLOPS/TOPSWith 2:4 Sparsity
FP64 (new for A100) 19.5 ---
TF32 (new) 156 312
BF16 (new) 312 624
FP16 312 624
INT8 624 1248
INT4 1248 2496
Note: A100 provides 2x FP16 throughput over V100 (312 vs 125 TFLOPS) plus TF32 and BF16 support. Structured sparsity doubles all throughput numbers.

Hopper (2022): FP8, WGMMA, and TMA

The Biggest Architectural Leap

Hopper (H100, SM 9.0) introduced three major changes that fundamentally altered how tensor cores are programmed:

  1. FP8 tensor cores: Native support for 8-bit floating point (E4M3 and E5M2 formats), doubling throughput over FP16.

  2. WGMMA (Warp Group Matrix Multiply Accumulate): A new instruction that operates on a warp group of 128 threads (4 warps) instead of a single warp (32 threads). This provides larger tile sizes and better utilization.

  3. TMA (Tensor Memory Accelerator): A hardware DMA engine that asynchronously loads tensor tiles from global memory to shared memory, decoupling data movement from compute.

// Hopper WGMMA: operates on 128-thread warp groups
// This is a simplified illustration -- actual WGMMA uses
// the CUDA 12.x+ MMA API or inline PTX

// PTX for WGMMA (Hopper, sm_90)
// wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16
asm volatile(
    "wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16 "
    "{%0,%1,%2,...}, " // 128 output registers (across 128 threads)
    "desc_a, desc_b, "  // Tensor descriptors (not raw pointers)
    "1, 1, 1;\n"       // scale_d, imm_scale_a, imm_scale_b
    : "=f"(d0), "=f"(d1), ...
    :
);

WGMMA vs MMA: 128 Threads Instead of 32

def wgmma_vs_mma_analysis():
    """Compare Hopper WGMMA with Ampere MMA."""
    comparison = {
        "Ampere MMA (m16n8k16)": {
            "threads": 32,
            "tile_m": 16,
            "tile_n": 8,
            "tile_k": 16,
            "output_elements": 16 * 8,  # 128
            "output_per_thread": 128 / 32,  # 4
        },
        "Hopper WGMMA (m64n256k16)": {
            "threads": 128,
            "tile_m": 64,
            "tile_n": 256,
            "tile_k": 16,
            "output_elements": 64 * 256,  # 16384
            "output_per_thread": 16384 / 128,  # 128
        },
    }

    for name, spec in comparison.items():
        print(f"\n{name}:")
        for key, val in spec.items():
            print(f"  {key}: {val}")

    # WGMMA produces 128x more output per instruction
    # This reduces instruction overhead and improves scheduling
    ratio = 16384 / 128
    print(f"\nOutput per instruction: WGMMA produces {ratio:.0f}x more "
          f"elements than MMA")

TMA: Hardware Tensor Memory Accelerator

// TMA loads a tensor tile from global to shared memory
// without using any threads -- it is a hardware DMA engine

// Create TMA descriptor (host-side, once)
CUtensorMap tma_desc;
cuTensorMapEncodeTiled(
    &tma_desc,
    CU_TENSOR_MAP_DATA_TYPE_FLOAT16,
    2,                    // num_dims
    global_ptr,           // global memory base
    dims,                 // global tensor dimensions
    strides,              // global tensor strides
    box_dims,             // tile dimensions to load
    element_strides,
    CU_TENSOR_MAP_INTERLEAVE_NONE,
    CU_TENSOR_MAP_SWIZZLE_128B,
    CU_TENSOR_MAP_L2_PROMOTION_L2_256B,
    CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE
);

// In the kernel: initiate TMA load (single thread issues the load,
// hardware moves the entire tile)
if (threadIdx.x == 0) {
    // Asynchronous: returns immediately, tile arrives in shared mem later
    cp_async_bulk_tensor_2d_global_to_shared(
        shared_ptr,    // destination in shared memory
        &tma_desc,     // TMA descriptor
        coord_x,       // tile coordinates in the global tensor
        coord_y,
        barrier        // arrival barrier
    );
}

// Wait for tile to arrive
barrier.wait();  // All threads wait for the DMA to complete
TMA Frees Threads for Compute

Without TMA, threads must issue load instructions to move data from global to shared memory. With TMA, a single thread initiates the transfer and the hardware DMA engine handles it. The remaining 127 threads (in a warp group) can continue computing on previously loaded tiles. This enables true overlap of data movement and computation within a single kernel.

📊

Hopper (H100 SXM) Tensor Core Throughput

FormatDense TFLOPS/TOPS2:4 Sparse TFLOPS/TOPS
FP64 67 134
TF32 495 990
BF16 990 1979
FP16 990 1979
FP8 (E4M3) 1979 3958
FP8 (E5M2) 1979 3958
INT8 1979 3958
Note: H100 provides 2x FP16 throughput over A100 (990 vs 312 TFLOPS) and adds FP8 at 2x FP16. With structured sparsity, FP8 reaches 3958 TFLOPS -- a 32x improvement over Volta.

Blackwell (2024): FP4 and Second-Gen Transformer Engine

FP4: Four-Bit Floating Point

Blackwell (B200/GB200, SM 10.0) introduces native FP4 tensor cores with a micro-scaling format:

  • FP4 E2M1: 2 exponent bits, 1 mantissa bit, 1 sign bit. Only 16 representable values per sign (including zero).
  • Micro-scaling: A shared 8-bit E8M0 scale factor per small block (typically 16 or 32 elements). This provides much higher dynamic range than the 4-bit format alone.
  • NVFP4: NVIDIA’s FP4 format with per-block E8M0 scales, designed for LLM inference.
def fp4_format_analysis():
    """Analyze FP4 E2M1 representable values."""
    # FP4 E2M1: sign(1) + exponent(2) + mantissa(1)
    # Exponent bias: 1
    # Normal values: (-1)^s * 2^(e-1) * (1 + m/2)
    # Special: e=0 is subnormal, e=3 is max normal (no inf/nan)

    values = set()
    for sign in [1, -1]:
        for e in range(4):  # 2-bit exponent: 0-3
            for m in range(2):  # 1-bit mantissa: 0-1
                if e == 0:
                    # Subnormal
                    val = sign * (2 ** (1 - 1)) * (m / 2)
                else:
                    val = sign * (2 ** (e - 1)) * (1 + m / 2)
                values.add(val)

    sorted_vals = sorted(values)
    print(f"FP4 E2M1 representable values ({len(sorted_vals)} total):")
    print(f"  {sorted_vals}")
    print(f"  Range: [{min(sorted_vals)}, {max(sorted_vals)}]")
    print(f"  With E8M0 micro-scale: range extends to "
          f"+/- {max(sorted_vals) * 2**127:.0e}")

fp4_format_analysis()

Blackwell Throughput

📊

Blackwell (B200 SXM) Tensor Core Throughput (Estimated)

FormatDense TFLOPS/TOPS2:4 Sparse TFLOPS/TOPSvs H100 Dense
FP64 90 180 1.34x
TF32 1125 2250 2.27x
BF16 2250 4500 2.27x
FP16 2250 4500 2.27x
FP8 (E4M3) 4500 9000 2.27x
FP4 (E2M1) 9000 18000 ---
FP6 (E3M2/E2M3) 4500 9000 ---
INT8 4500 9000 2.27x
Note: Blackwell adds FP4 (9000 TFLOPS dense) and FP6, plus 2.27x improvement across all existing formats vs H100. FP4 with 2:4 sparsity reaches 18000 TFLOPS -- 144x Volta.

Throughput Progression Across Generations

Tensor Core Throughput Evolution (TFLOPS/TOPS, dense)

(TFLOPS)
V100 FP16
125 TFLOPS
A100 FP16 2.5x V100
312 TFLOPS
A100 INT8
624 TFLOPS
H100 FP16 3.2x A100
990 TFLOPS
H100 FP8
1,979 TFLOPS
B200 FP8 2.3x H100
4,500 TFLOPS
B200 FP4 72x V100 FP16
9,000 TFLOPS

Runtime Detection of Tensor Core Support

Querying Device Capabilities

#include <cuda_runtime.h>
#include <cstdio>

struct TensorCoreSupport {
    bool fp16;
    bool bf16;
    bool tf32;
    bool fp8;
    bool fp4;
    bool int8;
    bool int4;
    bool sparse_2_4;
    int sm_version;

    void detect(int device_id = 0) {
        cudaDeviceProp props;
        cudaGetDeviceProperties(&props, device_id);

        sm_version = props.major * 10 + props.minor;

        // Tensor core availability by SM version
        fp16 = (sm_version >= 70);    // Volta+
        int8 = (sm_version >= 75);    // Turing+
        int4 = (sm_version >= 75);    // Turing+
        bf16 = (sm_version >= 80);    // Ampere+
        tf32 = (sm_version >= 80);    // Ampere+
        sparse_2_4 = (sm_version >= 80);  // Ampere+
        fp8 = (sm_version >= 90);     // Hopper+
        fp4 = (sm_version >= 100);    // Blackwell+

        printf("GPU: %s (SM %d.%d)\n",
               props.name, props.major, props.minor);
        printf("Tensor Core Support:\n");
        printf("  FP16:     %s\n", fp16 ? "Yes" : "No");
        printf("  BF16:     %s\n", bf16 ? "Yes" : "No");
        printf("  TF32:     %s\n", tf32 ? "Yes" : "No");
        printf("  FP8:      %s\n", fp8 ? "Yes" : "No");
        printf("  FP4:      %s\n", fp4 ? "Yes" : "No");
        printf("  INT8:     %s\n", int8 ? "Yes" : "No");
        printf("  INT4:     %s\n", int4 ? "Yes" : "No");
        printf("  2:4 Sparse: %s\n", sparse_2_4 ? "Yes" : "No");
    }
};

Python Detection

import torch

def detect_tensor_core_support():
    """Detect tensor core capabilities of the current GPU."""
    if not torch.cuda.is_available():
        print("No CUDA GPU available")
        return

    props = torch.cuda.get_device_properties(0)
    sm = props.major * 10 + props.minor

    capabilities = {
        'GPU': props.name,
        'SM Version': f'{props.major}.{props.minor}',
        'FP16 Tensor Cores (Volta+)': sm >= 70,
        'INT8 Tensor Cores (Turing+)': sm >= 75,
        'BF16/TF32 Tensor Cores (Ampere+)': sm >= 80,
        '2:4 Structured Sparsity (Ampere+)': sm >= 80,
        'FP8 Tensor Cores (Hopper+)': sm >= 90,
        'WGMMA Instructions (Hopper+)': sm >= 90,
        'TMA Hardware (Hopper+)': sm >= 90,
        'FP4 Tensor Cores (Blackwell+)': sm >= 100,
    }

    for capability, supported in capabilities.items():
        status = "Supported" if supported is True else (
            supported if isinstance(supported, str) else "Not available")
        print(f"  {capability}: {status}")

    # Recommended quantization strategy based on hardware
    if sm >= 100:
        rec = "FP4 weights, FP8 activations (NVFP4)"
    elif sm >= 90:
        rec = "FP8 weights and activations (Transformer Engine)"
    elif sm >= 80:
        rec = "INT8 W8A8 (SmoothQuant) or W4A16 (GPTQ/AWQ)"
    elif sm >= 75:
        rec = "INT8 W8A8 or W4A16"
    elif sm >= 70:
        rec = "FP16 (no quantization benefit from tensor cores)"
    else:
        rec = "FP32 (no tensor cores)"

    print(f"\n  Recommended quantization: {rec}")

detect_tensor_core_support()

Summary

Tensor core evolution follows a clear pattern: each generation adds lower-precision formats (FP16 to INT8 to FP8 to FP4) and increases the tile size of the basic matrix operation (16x16x16 per warp to 64x256x16 per warp group). The programming model evolved from WMMA (simple, warp-level) to MMA (low-level, warp-level) to WGMMA (warp-group-level with TMA integration). The throughput progression from Volta to Blackwell is 72x for the lowest supported precision at each generation, driven primarily by smaller numeric formats (which pack more operations per silicon area) and larger tile sizes (which improve utilization).

For LLM inference, the practical impact: H100 FP8 provides 2x throughput over FP16 at minimal quality loss. Blackwell FP4 will extend this to 4x over FP8 — if the quantization quality holds. The hardware is always ahead of the software: FP8 hardware shipped in 2022, but production FP8 inference only became mainstream in 2024. FP4 hardware ships in 2024, and production FP4 inference is likely 1-2 years away.