Part of Series GPU Hardware & AI Accelerators 12 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

The register file is the fastest and most constrained resource on a GPU. Each Hopper SM contains 65,536 32-bit registers — 256 KB of storage accessible at zero additional latency. Every active thread occupies a fixed number of these registers for the entire duration of its execution. A kernel that uses 32 registers per thread with 48 warps (1,536 threads) consumes 1536×32=49,1521536 \times 32 = 49,152 registers — 75% of the register file. Increase to 64 registers per thread, and the SM can only hold 1,024 threads (32 warps), dropping occupancy to 67%. At 128 registers per thread, maximum occupancy falls to 512 threads (16 warps) — 33%.

This inverse relationship between register usage and occupancy is the central tension in GPU kernel optimization. More registers per thread means each thread can hold more intermediate values without spilling to memory, running faster individually. But fewer threads means fewer warps available for latency hiding, potentially leaving execution units idle during memory stalls.

This post covers the register file’s physical architecture, how the compiler allocates registers, the exact occupancy equations, what happens when registers spill, and how to profile and optimize register pressure.

Register File Architecture

Physical Organization

The register file is physically distributed across the 4 SM partitions, but logically each partition has access to its share:

📊

Register File Specifications Across Architectures

SpecificationVolta (SM 7.0)Ampere (SM 8.0)Hopper (SM 9.0)
Registers per SM 65,536 65,536 65,536
Register file size per SM 256 KB 256 KB 256 KB
Register width 32 bits 32 bits 32 bits
Max registers per thread 255 255 255
Max threads per SM 2,048 2,048 1,536
Max warps per SM 64 64 48
Registers per SM partition 16,384 16,384 16,384
Register bank count 4 (per partition) 4 (per partition) 4 (per partition)
Read ports per cycle per bank 1 1 1
Note: The register file size has remained constant at 256 KB since Volta. Hopper reduced max threads to 1,536 (from 2,048 on Ampere) because warp group operations naturally limit concurrency.

Register Banks and Bank Conflicts

The register file is divided into 4 banks per partition (16 banks per SM). Each bank can serve one read per cycle. When an instruction needs two source operands from the same bank, a register bank conflict occurs — the operand collector must serialize the reads, adding a 1-cycle penalty.

The compiler (ptxas) attempts to allocate registers to avoid bank conflicts by distributing operands across banks. The allocation follows a pattern: consecutive register numbers map to different banks (R0 to bank 0, R1 to bank 1, R2 to bank 2, R3 to bank 3, R4 to bank 0, …).

// Register-to-bank mapping (simplified):
// Register R[n] → Bank (n % 4)
// R0 → Bank 0, R1 → Bank 1, R2 → Bank 2, R3 → Bank 3
// R4 → Bank 0, R5 → Bank 1, R6 → Bank 2, R7 → Bank 3

// Instruction: FFMA R4, R0, R1, R2
// Reads: R0 (Bank 0), R1 (Bank 1), R2 (Bank 2) → No conflict (3 different banks)

// Instruction: FFMA R4, R0, R4, R8
// Reads: R0 (Bank 0), R4 (Bank 0), R8 (Bank 0) → CONFLICT (all Bank 0)
// Takes 3 cycles to read instead of 1
ℹ️ Bank Conflicts Are Rare in Practice

The compiler is effective at avoiding register bank conflicts through careful register numbering. Nsight Compute reports bank conflicts under the “Source” section of the scheduler statistics. In well-optimized kernels, bank conflicts typically account for less than 2% of stall cycles. Manual intervention is almost never needed.

Register Allocation by the Compiler

Static Allocation

CUDA uses static register allocation. The compiler determines the register count per thread at compile time. This count is fixed for all threads in the kernel — there is no dynamic register allocation at runtime.

# View register allocation for a compiled kernel
nvcc -Xptxas -v my_kernel.cu -o my_kernel 2>&1 | grep "registers"
# Output: ptxas info : Used 48 registers, 0 bytes shared memory

# More detailed: dump SASS to see actual register usage
cuobjdump -sass my_kernel | head -20
# The SASS header shows register count and per-thread resource usage

How ptxas Decides Register Count

The ptxas assembler (NVIDIA’s PTX-to-SASS compiler) performs register allocation as the final compilation step. It:

  1. Performs liveness analysis to determine the maximum number of simultaneously live values.
  2. Assigns physical registers to virtual registers using graph coloring.
  3. If the number of live values exceeds the maximum register count (255, or the user-specified limit), it spills excess values to local memory.

The compiler’s allocation is usually near-optimal. It trades off between:

  • Fewer registers → Higher occupancy → Better latency hiding → But more spills
  • More registers → Lower occupancy → Worse latency hiding → But faster per-thread execution
# Examine the register allocation tradeoff
nvcc -Xptxas -v --maxrregcount=32 my_kernel.cu 2>&1
# ptxas info : Used 32 registers, 384 bytes spill stores, 384 bytes spill loads

nvcc -Xptxas -v --maxrregcount=64 my_kernel.cu 2>&1
# ptxas info : Used 64 registers, 0 bytes spill stores, 0 bytes spill loads

nvcc -Xptxas -v my_kernel.cu 2>&1
# ptxas info : Used 48 registers (compiler's default choice)

The Occupancy-Register Equation

Calculating Maximum Occupancy from Register Count

The register file places a hard limit on how many warps can be resident per SM:

max_warps=65536regs_per_thread×32\text{max\_warps} = \left\lfloor \frac{65536}{\text{regs\_per\_thread} \times 32} \right\rfloor

But this is capped by the hardware maximum (48 warps on Hopper, 64 on Ampere):

resident_warps=min(max_warps_hw,65536regs_per_thread×32)\text{resident\_warps} = \min\left(\text{max\_warps\_hw}, \left\lfloor \frac{65536}{\text{regs\_per\_thread} \times 32} \right\rfloor \right)

And occupancy:

occupancy=resident_warpsmax_warps_hw\text{occupancy} = \frac{\text{resident\_warps}}{\text{max\_warps\_hw}}

📊

Register Count vs Occupancy (Hopper SM 9.0, 48 max warps)

Regs/ThreadRegs/WarpMax Warps (register limit)Effective Max WarpsOccupancy
16 512 128 48 (hw cap) 100%
24 768 85 48 (hw cap) 100%
32 1,024 64 48 (hw cap) 100%
40 1,280 51 48 (hw cap) 100%
42 1,344 48 48 100%
48 1,536 42 42 87.5%
64 2,048 32 32 66.7%
80 2,560 25 25 52.1%
96 3,072 21 21 43.8%
128 4,096 16 16 33.3%
192 6,144 10 10 20.8%
255 8,160 8 8 16.7%
Note: Registers are allocated in granules of 8 on Hopper (was 4 on Volta). A kernel using 33 registers is rounded up to 40. The breakpoint for 100% occupancy on Hopper is 42 registers per thread.

Register Allocation Granularity

Registers are not allocated individually. They are allocated in granules — fixed-size blocks. On Hopper, the granule is 8 registers. If a kernel uses 33 registers per thread, the hardware allocates 40 (rounded up to the next multiple of 8).

This means small increases in register usage can have no effect on occupancy (if you stay within the same granule) or a large effect (if you cross a granule boundary that hits a warp limit):

// 40 regs/thread: 40*32 = 1280 regs/warp → 65536/1280 = 51 → capped at 48 → 100%
// 41 regs/thread: rounds to 48 → 48*32 = 1536 regs/warp → 65536/1536 = 42 → 87.5%
// That single extra register dropped occupancy by 12.5%
⚠️ The Granularity Cliff

Because of register granularity, adding one register to your kernel can cause zero change or a 12% occupancy drop. Always check the actual register allocation (via nvcc -Xptxas -v) after any code change. The compiler may eliminate or add registers in ways that are not obvious from the source code.

Register Spilling

What Spilling Means

When the compiler cannot fit all live values in registers (either because there are too many simultaneously live values, or because --maxrregcount is set too low), it spills excess values to local memory. Local memory is physically located in HBM (or L1/L2 cache if the data is recently accessed). Each spill is a store (to write the value out) and a load (to bring it back when needed).

# Detect spilling
nvcc -Xptxas -v my_kernel.cu 2>&1 | grep -i spill
# ptxas info : Used 32 registers, 96 bytes spill stores, 96 bytes spill loads
# 96 bytes = 24 float values spilled

The Cost of Spilling

A register access has 0-cycle latency. A spill load from L1 cache takes ~33 cycles. A spill load from HBM takes ~500 cycles. The spill traffic also competes with the kernel’s actual memory traffic for bandwidth.

// High-register kernel with no spills
__global__ void no_spill(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    // 12 intermediate values → ~12 registers for intermediates
    float a = in[idx];
    float b = a * a;
    float c = b + a;
    float d = c * b;
    float e = d + c;
    float f = e * d;
    float g = f + e;
    float h = g * f;
    float i = h + g;
    float j = i * h;
    float k = j + i;
    float l = k * j;
    out[idx] = l;
}
// Compiler uses ~20 registers. No spills. Full occupancy.
// Same kernel with 40+ intermediate values forcing spills at --maxrregcount=32
__global__ void forced_spill(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    float vals[40];  // 40 floats = 40 registers if no spill
    vals[0] = in[idx];
    for (int i = 1; i < 40; i++) {
        vals[i] = vals[i-1] * vals[i-1] + vals[i/2];  // Many live values
    }
    float sum = 0.0f;
    for (int i = 0; i < 40; i++) {
        sum += vals[i];
    }
    out[idx] = sum;
}
// At --maxrregcount=32: heavy spilling to local memory
// Spill stores/loads can add 50-100+ cycles per iteration

Impact of Register Spilling on Kernel Performance

(% of peak throughput)
64 regs, 0 spills, 67% occ. Baseline: optimal
100 % of peak throughput
48 regs, 16 spill loads, 87% occ. Spills hurt more than occupancy helps
78 % of peak throughput
32 regs, 48 spill loads, 100% occ. Severe spilling destroys performance
52 % of peak throughput
128 regs, 0 spills, 33% occ. Low occupancy but no spills
85 % of peak throughput
255 regs, 0 spills, 17% occ. Too few warps for latency hiding
61 % of peak throughput

Detecting Spills in SASS

# Dump SASS assembly to see spill instructions
cuobjdump -sass my_kernel | grep -E "STL|LDL"
# STL = Store to Local memory (spill store)
# LDL = Load from Local memory (spill load)

# Example SASS with spills:
# /*0090*/ STL [R1+0x10], R6 ;    // Spill R6 to local memory
# /*00a0*/ LDL R6, [R1+0x10] ;    // Reload R6 from local memory

Controlling Register Usage

Method 1: __launch_bounds__

The preferred method. Tells the compiler the maximum block size and desired minimum blocks per SM:

// Launch with at most 256 threads/block, want at least 4 blocks/SM
__global__ __launch_bounds__(256, 4)
void my_kernel(float* data, int n) {
    // With 4 blocks of 256 threads = 1024 threads = 32 warps
    // 65536 / 32 = 2048 registers per warp = 64 registers per thread max
    // Compiler will try to use <= 64 registers, spilling if necessary
    // ...
}

The compiler interprets __launch_bounds__(maxThreads, minBlocks) as:

  • The kernel will be launched with at most maxThreads threads per block.
  • The developer wants at least minBlocks blocks concurrently on each SM.
  • Therefore, the compiler targets at most 65536/(minBlocks×maxThreads)\lfloor 65536 / (\text{minBlocks} \times \text{maxThreads}) \rfloor registers per thread.
__launch_bounds__(256, 4)  // Target: 64 regs/thread
__launch_bounds__(256, 2)  // Target: 128 regs/thread (more generous)
__launch_bounds__(128, 8)  // Target: 64 regs/thread
__launch_bounds__(1024, 1) // Target: ~42 regs/thread (Hopper, capped by 48 warps)
Always Specify launch_bounds for Performance-Critical Kernels

Without __launch_bounds__, the compiler assumes worst-case block size (1024 threads) and targets register usage that allows at least 1 block per SM. This is overly conservative for most kernels. Specifying the actual launch configuration lets the compiler make better register allocation decisions. The second parameter (minBlocks) has the biggest impact — it directly controls the register budget.

Method 2: --maxrregcount

A global compile-time flag that caps registers for all kernels in the translation unit:

# Cap all kernels to 32 registers per thread
nvcc --maxrregcount=32 my_kernel.cu -o my_kernel

This is a blunt instrument. It applies to every kernel in the file, including simple utility kernels that do not benefit from the cap. Prefer __launch_bounds__ for per-kernel control.

Method 3: Inline Assembly Hints

In extreme cases, you can use inline PTX to hint at register usage:

__global__ void precise_kernel(float* data) {
    float a, b, c, d;
    asm volatile("" : "=f"(a), "=f"(b), "=f"(c), "=f"(d));
    // The asm volatile prevents the compiler from optimizing away the registers
    // Used to benchmark exact register counts
}

Register Tiling: The Optimization That Defines High-Performance Kernels

The Core Idea

Register tiling increases the number of output elements computed per thread by storing multiple values in registers. Each thread loads a small tile of input data into registers, computes multiple outputs, and writes them all back. The key insight: data loaded into registers can be reused across multiple computations at zero cost.

// Without tiling: 1 output per thread
__global__ void naive_add(float* C, const float* A, const float* B, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) C[i] = A[i] + B[i];
}
// 2 loads, 1 store, 1 ADD per thread
// Regs per thread: ~8 (indices, pointers, temporaries)

// With 4x register tiling: 4 outputs per thread
__global__ void tiled_add(float* C, const float* A, const float* B, int n) {
    int base = (blockIdx.x * blockDim.x + threadIdx.x) * 4;
    if (base + 3 < n) {
        float a0 = A[base], a1 = A[base+1], a2 = A[base+2], a3 = A[base+3];
        float b0 = B[base], b1 = B[base+1], b2 = B[base+2], b3 = B[base+3];
        C[base]   = a0 + b0;
        C[base+1] = a1 + b1;
        C[base+2] = a2 + b2;
        C[base+3] = a3 + b3;
    }
}
// 8 loads (coalesced in 2 transactions), 4 stores, 4 ADDs per thread
// Regs per thread: ~16 (8 data values + indices + pointers)
// 4x fewer threads launched → 4x fewer blocks → less scheduling overhead

Register Tiling in GEMM

The canonical example of register tiling is matrix multiplication. CUTLASS uses register tiles of size thread_tile_m x thread_tile_n — each thread computes that many output elements:

// CUTLASS-style register tiling for GEMM
// Each thread computes an 8x8 tile of C
// Thread stores: 8*8 = 64 accumulator registers (float → 64 regs)
//               + A fragment: 8 registers
//               + B fragment: 8 registers
//               + index/pointer registers: ~8
// Total: ~88 registers per thread

// With 88 regs/thread (rounded to 88):
// 65536 / (88 * 32) = 23 warps → 48% occupancy on Hopper

// But each thread computes 64 FMAs per K-loop iteration
// Arithmetic intensity: 64 FMAs / (8+8 loads) = 4 FMAs per load
// With shared memory tiling: 64 FMAs / (16 shared mem loads) ≈ 4 FMAs/load
// This is enough to keep the math pipe saturated at 48% occupancy
📊

GEMM Register Tile Sizes and Their Tradeoffs

Thread Tile (MxN)Accum RegsTotal Regs (approx)OccupancyArithmetic Intensity
4x4 16 ~40 100% Low — scheduler-bound
4x8 32 ~56 75% Medium
8x8 64 ~88 48% High — optimal for most GEMM
8x16 128 ~160 25% Very high — needs large K
16x16 256 (exceeds 255 limit) N/A N/A Impossible — register limit
Note: 8x8 is the sweet spot for most FP32 GEMM kernels on Hopper. It provides enough arithmetic intensity to saturate compute while maintaining sufficient occupancy for latency hiding.

Double Buffering Registers

A common technique to overlap memory loads with computation: maintain two sets of input registers. While computing on one set, load the next iteration’s data into the other:

// Double-buffered register tiling (pseudocode)
__global__ void gemm_double_buffer(/* ... */) {
    float a_reg[2][8];   // Two buffers for A fragment (8 regs each)
    float b_reg[2][8];   // Two buffers for B fragment
    float c_reg[8][8];   // Accumulator (64 regs)

    int buf = 0;
    // Load first tile
    load_a_tile(a_reg[buf], /* ... */);
    load_b_tile(b_reg[buf], /* ... */);

    for (int k = 0; k < K; k += TILE_K) {
        int next_buf = 1 - buf;
        // Start loading next tile into alternate buffer
        load_a_tile(a_reg[next_buf], /* k+TILE_K ... */);
        load_b_tile(b_reg[next_buf], /* k+TILE_K ... */);

        // Compute on current buffer (overlaps with loads above)
        for (int kk = 0; kk < TILE_K; kk++) {
            outer_product(c_reg, a_reg[buf], b_reg[buf], kk);
        }
        buf = next_buf;
    }
    // Store c_reg to global memory
}
// Total registers: 2*8 + 2*8 + 64 + overhead = ~100 registers
// Occupancy: ~20 warps → 42% on Hopper
// But loads and computes overlap — the scheduler hides load latency

Profiling Register Pressure with Nsight Compute

Key Metrics

# Profile register usage and its impact
ncu --metrics \
  launch__registers_per_thread,\
  launch__registers_per_thread_allocated,\
  sm__warps_active.avg,\
  sm__warps_active.avg.pct_of_peak_sustained_active,\
  l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum,\
  l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum \
  -k my_kernel ./my_app

The key metrics:

  • launch__registers_per_thread: Registers the kernel was compiled with.
  • launch__registers_per_thread_allocated: Registers actually allocated (after granularity rounding).
  • l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum: Local memory load sectors (spill loads).
  • l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum: Local memory store sectors (spill stores).
# Check if register pressure is limiting occupancy
ncu --section Occupancy -k my_kernel ./my_app
# The Occupancy section shows:
# - Theoretical occupancy (based on register count, shared memory, block size)
# - Achieved occupancy (actual)
# - Which resource is the occupancy limiter (registers, shared memory, or block size)
🚨 Local Memory Traffic Is a Red Flag

Any nonzero value for l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum means the kernel is spilling registers. Even if spills hit L1 cache (33-cycle latency instead of 500), they still consume LD/ST unit bandwidth and add stall cycles. The threshold for concern: if local memory traffic exceeds 5% of global memory traffic, spills are likely a significant performance factor.

Occupancy Analysis Workflow

# Step 1: Find the register count
nvcc -Xptxas -v my_kernel.cu 2>&1 | grep "Used"
# Used 72 registers, 8192 bytes smem, 0 bytes cmem

# Step 2: Calculate theoretical occupancy
# 72 regs → rounded to 72 (granule of 8 on Hopper) → already aligned
# Warps from registers: floor(65536 / (72 * 32)) = floor(28.4) = 28
# Warps from shared memory: 8192 bytes per block, 256 threads per block
#   SM has ~228 KB shared = 233472 bytes
#   Max blocks = floor(233472 / 8192) = 28 blocks
#   Warps from shared = 28 * 8 = 224 (not limiting)
# Warps from block size: 256 threads = 8 warps per block
#   Max blocks from threads: floor(1536 / 256) = 6 blocks → 48 warps (not limiting)
# Limiting factor: REGISTERS → 28 warps → 58.3% occupancy

# Step 3: Check if reducing registers helps
nvcc -Xptxas -v --maxrregcount=64 my_kernel.cu 2>&1
# Used 64 registers, 128 bytes spill stores, 128 bytes spill loads
# Warps from registers: floor(65536 / (64 * 32)) = 32 warps → 66.7%
# Gained 4 warps but added spills — profile to see if net positive

Architecture-Specific Register Considerations

Hopper: Warp Group Register Sharing

WGMMA instructions on Hopper operate on warp groups (4 warps, 128 threads). The 4 warps in a group share the tensor core result across their registers. The register layout for WGMMA operands follows a specific pattern:

// WGMMA m64n128k16 (FP16 accumulation in FP32):
// Each warp group uses 4 warps
// Each thread in the group holds a portion of the 64x128 output tile
// Thread[t] in Warp[w] holds: C[row, col] where
//   row = w*16 + (t / 4) * 2 + (t % 2)   (across 64 rows)
//   col = (t / 2) % 8 + group_offset      (across 128 columns)
// Total accumulators: 64 * 128 / 128 threads = 64 FP32 values per thread
// = 64 registers just for the accumulator

Ampere: INT32 + FP32 Concurrent Execution

On Ampere (SM 8.0), each partition can execute an INT32 instruction and an FP32 instruction concurrently (using separate datapaths). Index calculations (INT32) can overlap with floating-point computation (FP32), effectively doubling throughput for mixed-type code. However, both instruction types consume registers from the same 256 KB file — there is no separate integer register file.

// On Ampere, these can execute concurrently within the same warp:
int idx = threadIdx.x + blockIdx.x * blockDim.x;  // INT32 pipe
float val = a * b + c;                              // FP32 pipe
// Both consume registers from the same 65536-register file

Register Usage for Different Data Types

// FP32: 1 register per value (32 bits)
float x;      // 1 register

// FP64: 2 registers per value (64 bits, stored in register pair)
double y;     // 2 registers (R0:R1)

// FP16/BF16: 1 register holds 2 values (packed)
half2 z;      // 1 register (two FP16 values)

// INT8/FP8: 1 register holds 4 values
char4 w;      // 1 register (four INT8 values)
ℹ️ FP64 Doubles Register Pressure

A kernel using double instead of float doubles its register consumption. A GEMM kernel with an 8x8 thread tile needs 64 float accumulators (64 registers) but 128 registers for double accumulators — pushing occupancy below 25%. This is why FP64 kernels almost always have lower occupancy and require different tiling strategies than FP32 kernels.

Optimization Decision Tree

When profiling reveals register pressure as the occupancy limiter:

  1. Check for unnecessary live values. Are there variables that can be recomputed instead of stored? Recomputation uses ALU cycles (cheap) instead of registers (expensive).

  2. Check for excessive unrolling. Loop unrolling increases register pressure because all unrolled iterations’ values are live simultaneously. Reduce unroll factor.

  3. Try __launch_bounds__. Set minBlocks to your target occupancy. If the compiler spills, check whether the spill cost is less than the occupancy gain.

  4. Profile with different register caps.

# Sweep register caps and measure performance
for REGS in 32 48 64 80 96 128; do
    nvcc --maxrregcount=$REGS -o kernel_$REGS my_kernel.cu
    ncu --metrics \
      gpu__time_duration.sum,\
      launch__registers_per_thread,\
      sm__warps_active.avg.pct_of_peak_sustained_active,\
      l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum \
      ./kernel_$REGS >> sweep_results.txt 2>&1
done

Register Cap Sweep: GEMM 4096x4096 on H100

(ms (lower is better))
32 regs (100% occ, heavy spills) 4.2 ms — spills dominate
4,200 ms (lower is better)
48 regs (87% occ, moderate spills) 2.8 ms — still spilling
2,800 ms (lower is better)
64 regs (67% occ, no spills) 1.5 ms — sweet spot
1,500 ms (lower is better)
80 regs (52% occ, no spills) 1.4 ms — marginal gain
1,400 ms (lower is better)
96 regs (44% occ, no spills) 1.45 ms — occupancy starts hurting
1,450 ms (lower is better)
128 regs (33% occ, no spills) 1.7 ms — too few warps
1,700 ms (lower is better)
  1. Consider algorithmic changes. If the kernel fundamentally needs more registers than the occupancy target allows, restructure the algorithm: smaller tiles, streaming computation, or multi-pass approaches.

Summary

The register file is the single most important resource to understand for GPU kernel optimization. Its 256 KB capacity per SM has remained constant across Volta, Ampere, and Hopper — while everything else (memory bandwidth, compute throughput, tensor core performance) has scaled dramatically. Every register consumed per thread directly reduces the number of concurrent warps, and the relationship is nonlinear due to granularity effects.

The optimization process: compile with -Xptxas -v to see register counts, use Nsight Compute’s Occupancy section to identify whether registers are the limiting factor, sweep register caps to find the performance-optimal point, and use __launch_bounds__ to communicate the target to the compiler. Avoid --maxrregcount as a global setting. Prefer register tiling over shared memory when possible — registers are faster and do not require synchronization.

The key insight: the optimal register count is almost never the minimum or the maximum. It is the point where the marginal cost of one more register (reduced occupancy, fewer warps for latency hiding) exactly equals the marginal benefit (one fewer spill, one less memory access). Finding that point requires measurement, not theory.