A kernel running at 40% occupancy just doubled in performance after you reduced register usage to hit 75% occupancy. You try the same trick on a different kernel — same 40% to 75% occupancy boost — and performance stays flat. Why? The first kernel was latency-bound: extra warps hid memory access latency and kept compute units fed. The second kernel was memory-bandwidth-bound: more warps just meant more contention for the same 2 TB/s HBM pipe. Occupancy only matters when you are latency-bound, and blindly maximizing it often makes things worse by increasing register spills or shared memory bank conflicts.
Occupancy: The Hardware Reality
Occupancy quantifies hardware utilization as the ratio of active warps to the maximum warps an SM can support. On modern NVIDIA architectures (Ampere/Ada), each SM supports up to 64 warps (2048 threads). But achieving that maximum requires your kernel to be conservative with three competing resources.
Occupancy Limiting Factors in NVIDIA GPUs
| Resource | Constraint | Symptom When Limiting |
|---|---|---|
| Registers per thread | 65536 regs / (regs_per_thread x threads) | High register usage limits concurrent blocks |
| Shared memory per block | ~100KB / smem_per_block | Large tiles restrict block count |
| Threads per block | 2048 max threads per SM / block_dim | Very large blocks reduce concurrency |
The key insight is that these resources are allocated at block granularity. If your kernel uses 48 registers per thread with 256-thread blocks, each block consumes 48 x 256 = 12,288 registers. With 65,536 registers per SM, you can fit at most 5 blocks — giving you 5 x 8 = 40 warps out of 64 maximum (62.5% occupancy). Adding shared memory usage further reduces this.
SM Resource Competition Model
How different resources compete for SM allocation and affect achievable occupancy
Latency Hiding: The Mathematical Foundation
The real question isn’t “what’s my occupancy?” but “do I have enough active warps to hide memory latency?” GPUs achieve throughput by switching between warps when one stalls. The relationship is straightforward:
W_needed ~ L / C
Where L is the latency cycles to hide (e.g., ~400 cycles for a global memory access on Ampere) and C is the average cycles of useful compute work between stalls per warp.
For a memory-bound kernel where each warp does roughly 20 cycles of arithmetic between memory requests, you need W_needed ~ 400 / 20 = 20 warps. That’s only 31% occupancy — far below 100%. Once you have enough warps to keep the SM busy during stalls, adding more warps doesn’t help.
Latency Hiding: Occupancy vs SM Utilization
line| Metric | 25% | 50% | 75% | 100% |
|---|---|---|---|---|
| Cycles stalled on memory (%) | ||||
| SM active utilization (%) |
The data tells a clear story: going from 25% to 50% occupancy nearly halves stall time. But 75% to 100%? Almost no difference — the SM is already saturated with enough work.
Profiler Counter Analysis
Modern NVIDIA profilers (Nsight Compute) provide the metrics you need to make evidence-based occupancy decisions. Don’t guess — measure.
Nsight Compute Key Metrics for Occupancy Analysis The four profiler counters that tell you whether chasing higher occupancy will actually help performance
How to Read the Counters
The decision framework is simple. If stalled_long_scoreboard is high and sm__throughput is low, you’re leaving performance on the table — more occupancy would help. If throughput is already near peak, occupancy won’t move the needle regardless of what other counters say.
Many production CUDA kernels (CUTLASS GEMM tiles, FlashAttention, cuDNN convolutions) deliberately operate at 40-60% occupancy. They trade occupancy for larger register allocations and bigger shared memory tiles, which increases arithmetic intensity and reduces memory traffic — a net win.
Interactive: Try It Yourself
Adjust the sliders to see how block size, shared memory, and register usage interact to determine occupancy. Watch how the bottleneck shifts between resources:
CUDA Occupancy Calculator
Must be multiple of 32 (warp size)
Shared memory used per thread block
Register usage per thread
Occupancy Analysis
Block Size Optimization: A Practical Example
Consider a simple element-wise kernel where you’re sweeping block sizes:
// Template-based block size for compile-time optimization
template<int BLOCK_SIZE>
__global__ void elementwise_fma(float *x, float *y, int n) {
int idx = blockIdx.x * BLOCK_SIZE + threadIdx.x;
if (idx < n) {
float v = x[idx];
y[idx] = fmaf(v, 2.0f, 1.0f); // fused multiply-add
}
}
Profile each configuration with Nsight Compute and record what matters:
Block Size Sweep Results (V100, n=16M elements)
| Block Size | Occupancy | BW (GB/s) | SM Active % | Verdict |
|---|---|---|---|---|
| 64 | 35% | 420 | 40% | Under-occupied: high memory stalls |
| 128 | 55% | 780 | 72% | Good balance: stalls dropping fast |
| 256 | 72% | 810 | 74% | Diminishing returns begin |
| 512 | 72% | 805 | 73% | No improvement -- same effective occupancy |
The 128-over 256 jump yields only 4% more bandwidth despite 17% more occupancy. At 512, we hit the same block count limit and gain nothing. The right answer here is 256 — not because it maximizes occupancy, but because it’s the point where additional occupancy stops translating to additional throughput.
Decision Framework: When to Chase Occupancy
Occupancy Optimization Decision Matrix
A structured approach to deciding whether occupancy improvements are worth pursuing
The most common mistake is sacrificing tile size (and thus arithmetic intensity) to squeeze out more occupancy. If your kernel is memory-bound, the correct move is often to reduce occupancy by using more registers and shared memory per block, which lets you do more useful computation per memory access.
40-60% occupancy is usually sufficient when memory stalls are manageable. Prioritize coalesced memory access and good tiling geometry over raw occupancy numbers. Use the CUDA Occupancy Calculator (or the interactive widget above) to avoid pathologically low configurations (below 20%), but don’t optimize occupancy as a goal in itself.
Conclusion
Occupancy is a means to performance, not the goal. The optimal approach: measure your kernel’s actual stall reasons with Nsight Compute, determine whether those stalls would be hidden by more warps, and stop tuning occupancy once the SM is saturated. The cycles you save by not over-optimizing occupancy are better spent on memory access patterns, instruction mix, and algorithmic improvements.