The register file is the fastest and most constrained resource on a GPU. Each Hopper SM contains 65,536 32-bit registers — 256 KB of storage accessible at zero additional latency. Every active thread occupies a fixed number of these registers for the entire duration of its execution. A kernel that uses 32 registers per thread with 48 warps (1,536 threads) consumes registers — 75% of the register file. Increase to 64 registers per thread, and the SM can only hold 1,024 threads (32 warps), dropping occupancy to 67%. At 128 registers per thread, maximum occupancy falls to 512 threads (16 warps) — 33%.
This inverse relationship between register usage and occupancy is the central tension in GPU kernel optimization. More registers per thread means each thread can hold more intermediate values without spilling to memory, running faster individually. But fewer threads means fewer warps available for latency hiding, potentially leaving execution units idle during memory stalls.
This post covers the register file’s physical architecture, how the compiler allocates registers, the exact occupancy equations, what happens when registers spill, and how to profile and optimize register pressure.
Register File Architecture
Physical Organization
The register file is physically distributed across the 4 SM partitions, but logically each partition has access to its share:
Register File Specifications Across Architectures
| Specification | Volta (SM 7.0) | Ampere (SM 8.0) | Hopper (SM 9.0) |
|---|---|---|---|
| Registers per SM | 65,536 | 65,536 | 65,536 |
| Register file size per SM | 256 KB | 256 KB | 256 KB |
| Register width | 32 bits | 32 bits | 32 bits |
| Max registers per thread | 255 | 255 | 255 |
| Max threads per SM | 2,048 | 2,048 | 1,536 |
| Max warps per SM | 64 | 64 | 48 |
| Registers per SM partition | 16,384 | 16,384 | 16,384 |
| Register bank count | 4 (per partition) | 4 (per partition) | 4 (per partition) |
| Read ports per cycle per bank | 1 | 1 | 1 |
Register Banks and Bank Conflicts
The register file is divided into 4 banks per partition (16 banks per SM). Each bank can serve one read per cycle. When an instruction needs two source operands from the same bank, a register bank conflict occurs — the operand collector must serialize the reads, adding a 1-cycle penalty.
The compiler (ptxas) attempts to allocate registers to avoid bank conflicts by distributing operands across banks. The allocation follows a pattern: consecutive register numbers map to different banks (R0 to bank 0, R1 to bank 1, R2 to bank 2, R3 to bank 3, R4 to bank 0, …).
// Register-to-bank mapping (simplified):
// Register R[n] → Bank (n % 4)
// R0 → Bank 0, R1 → Bank 1, R2 → Bank 2, R3 → Bank 3
// R4 → Bank 0, R5 → Bank 1, R6 → Bank 2, R7 → Bank 3
// Instruction: FFMA R4, R0, R1, R2
// Reads: R0 (Bank 0), R1 (Bank 1), R2 (Bank 2) → No conflict (3 different banks)
// Instruction: FFMA R4, R0, R4, R8
// Reads: R0 (Bank 0), R4 (Bank 0), R8 (Bank 0) → CONFLICT (all Bank 0)
// Takes 3 cycles to read instead of 1
The compiler is effective at avoiding register bank conflicts through careful register numbering. Nsight Compute reports bank conflicts under the “Source” section of the scheduler statistics. In well-optimized kernels, bank conflicts typically account for less than 2% of stall cycles. Manual intervention is almost never needed.
Register Allocation by the Compiler
Static Allocation
CUDA uses static register allocation. The compiler determines the register count per thread at compile time. This count is fixed for all threads in the kernel — there is no dynamic register allocation at runtime.
# View register allocation for a compiled kernel
nvcc -Xptxas -v my_kernel.cu -o my_kernel 2>&1 | grep "registers"
# Output: ptxas info : Used 48 registers, 0 bytes shared memory
# More detailed: dump SASS to see actual register usage
cuobjdump -sass my_kernel | head -20
# The SASS header shows register count and per-thread resource usage
How ptxas Decides Register Count
The ptxas assembler (NVIDIA’s PTX-to-SASS compiler) performs register allocation as the final compilation step. It:
- Performs liveness analysis to determine the maximum number of simultaneously live values.
- Assigns physical registers to virtual registers using graph coloring.
- If the number of live values exceeds the maximum register count (255, or the user-specified limit), it spills excess values to local memory.
The compiler’s allocation is usually near-optimal. It trades off between:
- Fewer registers → Higher occupancy → Better latency hiding → But more spills
- More registers → Lower occupancy → Worse latency hiding → But faster per-thread execution
# Examine the register allocation tradeoff
nvcc -Xptxas -v --maxrregcount=32 my_kernel.cu 2>&1
# ptxas info : Used 32 registers, 384 bytes spill stores, 384 bytes spill loads
nvcc -Xptxas -v --maxrregcount=64 my_kernel.cu 2>&1
# ptxas info : Used 64 registers, 0 bytes spill stores, 0 bytes spill loads
nvcc -Xptxas -v my_kernel.cu 2>&1
# ptxas info : Used 48 registers (compiler's default choice)
The Occupancy-Register Equation
Calculating Maximum Occupancy from Register Count
The register file places a hard limit on how many warps can be resident per SM:
But this is capped by the hardware maximum (48 warps on Hopper, 64 on Ampere):
And occupancy:
Register Count vs Occupancy (Hopper SM 9.0, 48 max warps)
| Regs/Thread | Regs/Warp | Max Warps (register limit) | Effective Max Warps | Occupancy |
|---|---|---|---|---|
| 16 | 512 | 128 | 48 (hw cap) | 100% |
| 24 | 768 | 85 | 48 (hw cap) | 100% |
| 32 | 1,024 | 64 | 48 (hw cap) | 100% |
| 40 | 1,280 | 51 | 48 (hw cap) | 100% |
| 42 | 1,344 | 48 | 48 | 100% |
| 48 | 1,536 | 42 | 42 | 87.5% |
| 64 | 2,048 | 32 | 32 | 66.7% |
| 80 | 2,560 | 25 | 25 | 52.1% |
| 96 | 3,072 | 21 | 21 | 43.8% |
| 128 | 4,096 | 16 | 16 | 33.3% |
| 192 | 6,144 | 10 | 10 | 20.8% |
| 255 | 8,160 | 8 | 8 | 16.7% |
Register Allocation Granularity
Registers are not allocated individually. They are allocated in granules — fixed-size blocks. On Hopper, the granule is 8 registers. If a kernel uses 33 registers per thread, the hardware allocates 40 (rounded up to the next multiple of 8).
This means small increases in register usage can have no effect on occupancy (if you stay within the same granule) or a large effect (if you cross a granule boundary that hits a warp limit):
// 40 regs/thread: 40*32 = 1280 regs/warp → 65536/1280 = 51 → capped at 48 → 100%
// 41 regs/thread: rounds to 48 → 48*32 = 1536 regs/warp → 65536/1536 = 42 → 87.5%
// That single extra register dropped occupancy by 12.5%
Because of register granularity, adding one register to your kernel can cause zero change or a 12% occupancy drop. Always check the actual register allocation (via nvcc -Xptxas -v) after any code change. The compiler may eliminate or add registers in ways that are not obvious from the source code.
Register Spilling
What Spilling Means
When the compiler cannot fit all live values in registers (either because there are too many simultaneously live values, or because --maxrregcount is set too low), it spills excess values to local memory. Local memory is physically located in HBM (or L1/L2 cache if the data is recently accessed). Each spill is a store (to write the value out) and a load (to bring it back when needed).
# Detect spilling
nvcc -Xptxas -v my_kernel.cu 2>&1 | grep -i spill
# ptxas info : Used 32 registers, 96 bytes spill stores, 96 bytes spill loads
# 96 bytes = 24 float values spilled
The Cost of Spilling
A register access has 0-cycle latency. A spill load from L1 cache takes ~33 cycles. A spill load from HBM takes ~500 cycles. The spill traffic also competes with the kernel’s actual memory traffic for bandwidth.
// High-register kernel with no spills
__global__ void no_spill(float* out, const float* in, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n) return;
// 12 intermediate values → ~12 registers for intermediates
float a = in[idx];
float b = a * a;
float c = b + a;
float d = c * b;
float e = d + c;
float f = e * d;
float g = f + e;
float h = g * f;
float i = h + g;
float j = i * h;
float k = j + i;
float l = k * j;
out[idx] = l;
}
// Compiler uses ~20 registers. No spills. Full occupancy.
// Same kernel with 40+ intermediate values forcing spills at --maxrregcount=32
__global__ void forced_spill(float* out, const float* in, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n) return;
float vals[40]; // 40 floats = 40 registers if no spill
vals[0] = in[idx];
for (int i = 1; i < 40; i++) {
vals[i] = vals[i-1] * vals[i-1] + vals[i/2]; // Many live values
}
float sum = 0.0f;
for (int i = 0; i < 40; i++) {
sum += vals[i];
}
out[idx] = sum;
}
// At --maxrregcount=32: heavy spilling to local memory
// Spill stores/loads can add 50-100+ cycles per iteration
Impact of Register Spilling on Kernel Performance
(% of peak throughput)Detecting Spills in SASS
# Dump SASS assembly to see spill instructions
cuobjdump -sass my_kernel | grep -E "STL|LDL"
# STL = Store to Local memory (spill store)
# LDL = Load from Local memory (spill load)
# Example SASS with spills:
# /*0090*/ STL [R1+0x10], R6 ; // Spill R6 to local memory
# /*00a0*/ LDL R6, [R1+0x10] ; // Reload R6 from local memory
Controlling Register Usage
Method 1: __launch_bounds__
The preferred method. Tells the compiler the maximum block size and desired minimum blocks per SM:
// Launch with at most 256 threads/block, want at least 4 blocks/SM
__global__ __launch_bounds__(256, 4)
void my_kernel(float* data, int n) {
// With 4 blocks of 256 threads = 1024 threads = 32 warps
// 65536 / 32 = 2048 registers per warp = 64 registers per thread max
// Compiler will try to use <= 64 registers, spilling if necessary
// ...
}
The compiler interprets __launch_bounds__(maxThreads, minBlocks) as:
- The kernel will be launched with at most
maxThreadsthreads per block. - The developer wants at least
minBlocksblocks concurrently on each SM. - Therefore, the compiler targets at most registers per thread.
__launch_bounds__(256, 4) // Target: 64 regs/thread
__launch_bounds__(256, 2) // Target: 128 regs/thread (more generous)
__launch_bounds__(128, 8) // Target: 64 regs/thread
__launch_bounds__(1024, 1) // Target: ~42 regs/thread (Hopper, capped by 48 warps)
Without __launch_bounds__, the compiler assumes worst-case block size (1024 threads) and targets register usage that allows at least 1 block per SM. This is overly conservative for most kernels. Specifying the actual launch configuration lets the compiler make better register allocation decisions. The second parameter (minBlocks) has the biggest impact — it directly controls the register budget.
Method 2: --maxrregcount
A global compile-time flag that caps registers for all kernels in the translation unit:
# Cap all kernels to 32 registers per thread
nvcc --maxrregcount=32 my_kernel.cu -o my_kernel
This is a blunt instrument. It applies to every kernel in the file, including simple utility kernels that do not benefit from the cap. Prefer __launch_bounds__ for per-kernel control.
Method 3: Inline Assembly Hints
In extreme cases, you can use inline PTX to hint at register usage:
__global__ void precise_kernel(float* data) {
float a, b, c, d;
asm volatile("" : "=f"(a), "=f"(b), "=f"(c), "=f"(d));
// The asm volatile prevents the compiler from optimizing away the registers
// Used to benchmark exact register counts
}
Register Tiling: The Optimization That Defines High-Performance Kernels
The Core Idea
Register tiling increases the number of output elements computed per thread by storing multiple values in registers. Each thread loads a small tile of input data into registers, computes multiple outputs, and writes them all back. The key insight: data loaded into registers can be reused across multiple computations at zero cost.
// Without tiling: 1 output per thread
__global__ void naive_add(float* C, const float* A, const float* B, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}
// 2 loads, 1 store, 1 ADD per thread
// Regs per thread: ~8 (indices, pointers, temporaries)
// With 4x register tiling: 4 outputs per thread
__global__ void tiled_add(float* C, const float* A, const float* B, int n) {
int base = (blockIdx.x * blockDim.x + threadIdx.x) * 4;
if (base + 3 < n) {
float a0 = A[base], a1 = A[base+1], a2 = A[base+2], a3 = A[base+3];
float b0 = B[base], b1 = B[base+1], b2 = B[base+2], b3 = B[base+3];
C[base] = a0 + b0;
C[base+1] = a1 + b1;
C[base+2] = a2 + b2;
C[base+3] = a3 + b3;
}
}
// 8 loads (coalesced in 2 transactions), 4 stores, 4 ADDs per thread
// Regs per thread: ~16 (8 data values + indices + pointers)
// 4x fewer threads launched → 4x fewer blocks → less scheduling overhead
Register Tiling in GEMM
The canonical example of register tiling is matrix multiplication. CUTLASS uses register tiles of size thread_tile_m x thread_tile_n — each thread computes that many output elements:
// CUTLASS-style register tiling for GEMM
// Each thread computes an 8x8 tile of C
// Thread stores: 8*8 = 64 accumulator registers (float → 64 regs)
// + A fragment: 8 registers
// + B fragment: 8 registers
// + index/pointer registers: ~8
// Total: ~88 registers per thread
// With 88 regs/thread (rounded to 88):
// 65536 / (88 * 32) = 23 warps → 48% occupancy on Hopper
// But each thread computes 64 FMAs per K-loop iteration
// Arithmetic intensity: 64 FMAs / (8+8 loads) = 4 FMAs per load
// With shared memory tiling: 64 FMAs / (16 shared mem loads) ≈ 4 FMAs/load
// This is enough to keep the math pipe saturated at 48% occupancy
GEMM Register Tile Sizes and Their Tradeoffs
| Thread Tile (MxN) | Accum Regs | Total Regs (approx) | Occupancy | Arithmetic Intensity |
|---|---|---|---|---|
| 4x4 | 16 | ~40 | 100% | Low — scheduler-bound |
| 4x8 | 32 | ~56 | 75% | Medium |
| 8x8 | 64 | ~88 | 48% | High — optimal for most GEMM |
| 8x16 | 128 | ~160 | 25% | Very high — needs large K |
| 16x16 | 256 (exceeds 255 limit) | N/A | N/A | Impossible — register limit |
Double Buffering Registers
A common technique to overlap memory loads with computation: maintain two sets of input registers. While computing on one set, load the next iteration’s data into the other:
// Double-buffered register tiling (pseudocode)
__global__ void gemm_double_buffer(/* ... */) {
float a_reg[2][8]; // Two buffers for A fragment (8 regs each)
float b_reg[2][8]; // Two buffers for B fragment
float c_reg[8][8]; // Accumulator (64 regs)
int buf = 0;
// Load first tile
load_a_tile(a_reg[buf], /* ... */);
load_b_tile(b_reg[buf], /* ... */);
for (int k = 0; k < K; k += TILE_K) {
int next_buf = 1 - buf;
// Start loading next tile into alternate buffer
load_a_tile(a_reg[next_buf], /* k+TILE_K ... */);
load_b_tile(b_reg[next_buf], /* k+TILE_K ... */);
// Compute on current buffer (overlaps with loads above)
for (int kk = 0; kk < TILE_K; kk++) {
outer_product(c_reg, a_reg[buf], b_reg[buf], kk);
}
buf = next_buf;
}
// Store c_reg to global memory
}
// Total registers: 2*8 + 2*8 + 64 + overhead = ~100 registers
// Occupancy: ~20 warps → 42% on Hopper
// But loads and computes overlap — the scheduler hides load latency
Profiling Register Pressure with Nsight Compute
Key Metrics
# Profile register usage and its impact
ncu --metrics \
launch__registers_per_thread,\
launch__registers_per_thread_allocated,\
sm__warps_active.avg,\
sm__warps_active.avg.pct_of_peak_sustained_active,\
l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum,\
l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum \
-k my_kernel ./my_app
The key metrics:
launch__registers_per_thread: Registers the kernel was compiled with.launch__registers_per_thread_allocated: Registers actually allocated (after granularity rounding).l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum: Local memory load sectors (spill loads).l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum: Local memory store sectors (spill stores).
# Check if register pressure is limiting occupancy
ncu --section Occupancy -k my_kernel ./my_app
# The Occupancy section shows:
# - Theoretical occupancy (based on register count, shared memory, block size)
# - Achieved occupancy (actual)
# - Which resource is the occupancy limiter (registers, shared memory, or block size)
Any nonzero value for l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum means the kernel is spilling registers. Even if spills hit L1 cache (33-cycle latency instead of 500), they still consume LD/ST unit bandwidth and add stall cycles. The threshold for concern: if local memory traffic exceeds 5% of global memory traffic, spills are likely a significant performance factor.
Occupancy Analysis Workflow
# Step 1: Find the register count
nvcc -Xptxas -v my_kernel.cu 2>&1 | grep "Used"
# Used 72 registers, 8192 bytes smem, 0 bytes cmem
# Step 2: Calculate theoretical occupancy
# 72 regs → rounded to 72 (granule of 8 on Hopper) → already aligned
# Warps from registers: floor(65536 / (72 * 32)) = floor(28.4) = 28
# Warps from shared memory: 8192 bytes per block, 256 threads per block
# SM has ~228 KB shared = 233472 bytes
# Max blocks = floor(233472 / 8192) = 28 blocks
# Warps from shared = 28 * 8 = 224 (not limiting)
# Warps from block size: 256 threads = 8 warps per block
# Max blocks from threads: floor(1536 / 256) = 6 blocks → 48 warps (not limiting)
# Limiting factor: REGISTERS → 28 warps → 58.3% occupancy
# Step 3: Check if reducing registers helps
nvcc -Xptxas -v --maxrregcount=64 my_kernel.cu 2>&1
# Used 64 registers, 128 bytes spill stores, 128 bytes spill loads
# Warps from registers: floor(65536 / (64 * 32)) = 32 warps → 66.7%
# Gained 4 warps but added spills — profile to see if net positive
Architecture-Specific Register Considerations
Hopper: Warp Group Register Sharing
WGMMA instructions on Hopper operate on warp groups (4 warps, 128 threads). The 4 warps in a group share the tensor core result across their registers. The register layout for WGMMA operands follows a specific pattern:
// WGMMA m64n128k16 (FP16 accumulation in FP32):
// Each warp group uses 4 warps
// Each thread in the group holds a portion of the 64x128 output tile
// Thread[t] in Warp[w] holds: C[row, col] where
// row = w*16 + (t / 4) * 2 + (t % 2) (across 64 rows)
// col = (t / 2) % 8 + group_offset (across 128 columns)
// Total accumulators: 64 * 128 / 128 threads = 64 FP32 values per thread
// = 64 registers just for the accumulator
Ampere: INT32 + FP32 Concurrent Execution
On Ampere (SM 8.0), each partition can execute an INT32 instruction and an FP32 instruction concurrently (using separate datapaths). Index calculations (INT32) can overlap with floating-point computation (FP32), effectively doubling throughput for mixed-type code. However, both instruction types consume registers from the same 256 KB file — there is no separate integer register file.
// On Ampere, these can execute concurrently within the same warp:
int idx = threadIdx.x + blockIdx.x * blockDim.x; // INT32 pipe
float val = a * b + c; // FP32 pipe
// Both consume registers from the same 65536-register file
Register Usage for Different Data Types
// FP32: 1 register per value (32 bits)
float x; // 1 register
// FP64: 2 registers per value (64 bits, stored in register pair)
double y; // 2 registers (R0:R1)
// FP16/BF16: 1 register holds 2 values (packed)
half2 z; // 1 register (two FP16 values)
// INT8/FP8: 1 register holds 4 values
char4 w; // 1 register (four INT8 values)
A kernel using double instead of float doubles its register consumption. A GEMM kernel with an 8x8 thread tile needs 64 float accumulators (64 registers) but 128 registers for double accumulators — pushing occupancy below 25%. This is why FP64 kernels almost always have lower occupancy and require different tiling strategies than FP32 kernels.
Optimization Decision Tree
When profiling reveals register pressure as the occupancy limiter:
-
Check for unnecessary live values. Are there variables that can be recomputed instead of stored? Recomputation uses ALU cycles (cheap) instead of registers (expensive).
-
Check for excessive unrolling. Loop unrolling increases register pressure because all unrolled iterations’ values are live simultaneously. Reduce unroll factor.
-
Try
__launch_bounds__. Set minBlocks to your target occupancy. If the compiler spills, check whether the spill cost is less than the occupancy gain. -
Profile with different register caps.
# Sweep register caps and measure performance
for REGS in 32 48 64 80 96 128; do
nvcc --maxrregcount=$REGS -o kernel_$REGS my_kernel.cu
ncu --metrics \
gpu__time_duration.sum,\
launch__registers_per_thread,\
sm__warps_active.avg.pct_of_peak_sustained_active,\
l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum \
./kernel_$REGS >> sweep_results.txt 2>&1
done
Register Cap Sweep: GEMM 4096x4096 on H100
(ms (lower is better))- Consider algorithmic changes. If the kernel fundamentally needs more registers than the occupancy target allows, restructure the algorithm: smaller tiles, streaming computation, or multi-pass approaches.
Summary
The register file is the single most important resource to understand for GPU kernel optimization. Its 256 KB capacity per SM has remained constant across Volta, Ampere, and Hopper — while everything else (memory bandwidth, compute throughput, tensor core performance) has scaled dramatically. Every register consumed per thread directly reduces the number of concurrent warps, and the relationship is nonlinear due to granularity effects.
The optimization process: compile with -Xptxas -v to see register counts, use Nsight Compute’s Occupancy section to identify whether registers are the limiting factor, sweep register caps to find the performance-optimal point, and use __launch_bounds__ to communicate the target to the compiler. Avoid --maxrregcount as a global setting. Prefer register tiling over shared memory when possible — registers are faster and do not require synchronization.
The key insight: the optimal register count is almost never the minimum or the maximum. It is the point where the marginal cost of one more register (reduced occupancy, fewer warps for latency hiding) exactly equals the marginal benefit (one fewer spill, one less memory access). Finding that point requires measurement, not theory.