A kernel with 100% occupancy achieves 250 GB/s on an A100. Reduce occupancy to 50% and throughput jumps to 310 GB/s — the kernel is faster at lower occupancy. This counterintuitive result happens when register pressure is the bottleneck: reducing occupancy gives each thread more registers, eliminating register spills to local memory that cost 400+ cycles each. Occupancy is not a goal. It is a tuning knob. High occupancy hides memory latency through thread-level parallelism. Low occupancy provides more registers and shared memory per thread. The optimal point depends on whether your kernel is latency-bound or resource-bound.
All measurements target NVIDIA Ampere (A100-80GB SXM, SM 8.0) unless stated otherwise.
The Three Occupancy Limiters
An SM has a fixed pool of each resource. Every block assigned to the SM consumes a portion. Once any resource is exhausted, no more blocks can be assigned, even if the other resources have capacity remaining.
Resource Pools by Architecture
Per-SM Resource Limits by Architecture
| Architecture | Max Threads | Max Warps | Max Blocks | Registers | Max Shared Mem |
|---|---|---|---|---|---|
| Volta (SM 7.0) | 2048 | 64 | 32 | 65536 | 96 KB |
| Turing (SM 7.5) | 1024 | 32 | 16 | 65536 | 64 KB |
| Ampere (SM 8.0) | 2048 | 64 | 32 | 65536 | 164 KB |
| Ada (SM 8.9) | 1536 | 48 | 24 | 65536 | 100 KB |
| Hopper (SM 9.0) | 2048 | 64 | 32 | 65536 | 228 KB |
Limiter 1: Registers Per Thread
The register file on an A100 SM is 65,536 32-bit registers. These are partitioned across all active threads. If your kernel uses 32 registers per thread and you want 2048 threads active:
That exactly fills the register file, so you get 100% occupancy. If the kernel uses 40 registers per thread:
But register allocation happens in granules. On SM 8.0, registers are allocated in units of 256 registers per warp (8 registers per thread, since a warp is 32 threads). So the effective registers per thread rounds up to the next multiple of 8:
51 warps out of a maximum 64 gives 79.7% occupancy.
// Check register usage at compile time
// nvcc -Xptxas -v my_kernel.cu
// Output: Used 32 registers, ... bytes smem, ... bytes cmem
// Or use --maxrregcount to cap registers:
// nvcc --maxrregcount=32 my_kernel.cu
// Per-kernel register cap using launch_bounds:
__global__ void __launch_bounds__(256, 4) // maxThreadsPerBlock, minBlocksPerMultiprocessor
my_kernel(float* data, int n) {
// With 256 threads/block and 4 blocks/SM:
// 256 * 4 = 1024 threads, need 1024 * regs <= 65536
// So compiler will use at most 64 registers per thread
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
data[idx] *= 2.0f;
}
}
When you constrain register usage with --maxrregcount or __launch_bounds__, the compiler may spill registers to local memory. Local memory is backed by L1/L2 cache and, on miss, global memory (DRAM). A kernel that spills 8 registers and gets 100% occupancy can be slower than one that uses all 64 registers at 50% occupancy. Always benchmark.
Limiter 2: Shared Memory Per Block
On an A100, the maximum shared memory per SM is 164 KB (configurable; default may be lower). Shared memory is allocated per block, not per thread. If a kernel uses 48 KB of shared memory per block:
With 256 threads per block, that is threads, or warps. Occupancy = .
// Configure shared memory carveout (Ampere+)
cudaFuncSetAttribute(my_kernel,
cudaFuncAttributeMaxDynamicSharedMemorySize,
164 * 1024); // Request 164 KB
// Shared memory allocation granularity is 256 bytes on Ampere
// A kernel requesting 1 byte of shared memory actually reserves 256 bytes
__global__ void kernel_with_smem() {
__shared__ float tile[32][33]; // 32*33*4 = 4224 bytes -> rounds to 4352 (17 * 256)
// ...
}
// Dynamic shared memory:
__global__ void kernel_dynamic_smem(float* out) {
extern __shared__ float smem[];
// Size specified at launch: kernel<<<grid, block, smem_bytes>>>()
}
Limiter 3: Block Slots
Each SM has a maximum number of resident blocks (32 on A100). This becomes the limiter when blocks are very small. A block of 32 threads (1 warp) would need 64 blocks to reach max occupancy, but the SM only supports 32 blocks, capping you at 32 warps = 50% occupancy.
This is why block sizes below 128 are rarely optimal. At 64 threads (2 warps per block), warps fills the SM. At 32 threads, you lose half your capacity to the block slot limit.
Occupancy Calculation: Complete Walkthrough
Given a kernel with 40 registers per thread, 8 KB of shared memory per block, launched with 256 threads per block, on an A100:
Step 1: Register constraint
Step 2: Shared memory constraint
Step 3: Block slot constraint
Step 4: Thread limit
Result: warps. Occupancy = .
The register constraint is the active limiter. Reducing register usage from 40 to 32 would allow:
#include <cuda_runtime.h>
#include <cstdio>
// Manual occupancy calculation
void calculate_occupancy(int regs_per_thread, int smem_per_block, int threads_per_block) {
// A100 constants
const int max_warps_per_sm = 64;
const int max_blocks_per_sm = 32;
const int total_regs = 65536;
const int total_smem = 164 * 1024; // After carveout config
const int reg_alloc_granularity = 8; // regs per thread, must be multiple of 8
int warps_per_block = (threads_per_block + 31) / 32;
// Register constraint
int effective_regs = ((regs_per_thread + reg_alloc_granularity - 1)
/ reg_alloc_granularity) * reg_alloc_granularity;
int regs_per_warp = effective_regs * 32;
int max_warps_reg = total_regs / regs_per_warp;
// Shared memory constraint
int smem_rounded = ((smem_per_block + 255) / 256) * 256;
int max_blocks_smem = (smem_rounded > 0) ? total_smem / smem_rounded : max_blocks_per_sm;
int max_warps_smem = max_blocks_smem * warps_per_block;
// Block slot constraint
int max_warps_slot = max_blocks_per_sm * warps_per_block;
// Final
int active_warps = max_warps_reg;
if (active_warps > max_warps_smem) active_warps = max_warps_smem;
if (active_warps > max_warps_slot) active_warps = max_warps_slot;
if (active_warps > max_warps_per_sm) active_warps = max_warps_per_sm;
// Must be a multiple of warps_per_block
int active_blocks = active_warps / warps_per_block;
active_warps = active_blocks * warps_per_block;
float occupancy = (float)active_warps / max_warps_per_sm * 100.0f;
printf("Regs/thread: %d (effective: %d)\n", regs_per_thread, effective_regs);
printf("Smem/block: %d (rounded: %d)\n", smem_per_block, smem_rounded);
printf("Threads/block: %d, Warps/block: %d\n", threads_per_block, warps_per_block);
printf("Max warps: reg=%d, smem=%d, slot=%d, hw=%d\n",
max_warps_reg, max_warps_smem, max_warps_slot, max_warps_per_sm);
printf("Active blocks: %d, Active warps: %d\n", active_blocks, active_warps);
printf("Occupancy: %.1f%%\n", occupancy);
}
The CUDA Occupancy API
CUDA provides runtime functions to compute occupancy without manual calculation:
#include <cuda_runtime.h>
#include <cstdio>
__global__ void example_kernel(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
// Some work that uses a moderate number of registers
float x = data[idx];
float y = x * x + 3.0f * x - 7.0f;
float z = sqrtf(y > 0 ? y : -y);
data[idx] = z;
}
}
void query_occupancy() {
int device;
cudaGetDevice(&device);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, device);
// Method 1: cudaOccupancyMaxActiveBlocksPerMultiprocessor
int block_size = 256;
int dynamic_smem = 0;
int max_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_blocks, example_kernel, block_size, dynamic_smem);
int max_warps = prop.maxThreadsPerMultiProcessor / 32;
int active_warps = max_blocks * (block_size / 32);
float occupancy = (float)active_warps / max_warps;
printf("Block size %d: %d blocks/SM, %.1f%% occupancy\n",
block_size, max_blocks, occupancy * 100.0f);
// Method 2: Let CUDA choose optimal block size
int min_grid_size, opt_block_size;
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size, &opt_block_size, example_kernel, dynamic_smem, 0);
printf("Optimal block size: %d (min grid: %d)\n", opt_block_size, min_grid_size);
// Method 3: With dynamic shared memory that depends on block size
auto smem_config = [](int block_size) -> size_t {
return block_size * sizeof(float); // 1 float per thread
};
cudaOccupancyMaxPotentialBlockSizeVariableSMem(
&min_grid_size, &opt_block_size, example_kernel, smem_config, 0);
printf("Optimal block size (variable smem): %d\n", opt_block_size);
}
Sweeping Block Sizes
#include <cuda_runtime.h>
#include <cstdio>
__global__ void target_kernel(float* data, int n) {
extern __shared__ float smem[];
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
smem[threadIdx.x] = data[idx];
__syncthreads();
data[idx] = smem[threadIdx.x] * 2.0f;
}
}
void sweep_block_sizes() {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int max_warps = prop.maxThreadsPerMultiProcessor / 32;
printf("Block Size | Blocks/SM | Active Warps | Occupancy\n");
printf("-----------|-----------|--------------|----------\n");
for (int bs = 32; bs <= 1024; bs += 32) {
int dynamic_smem = bs * sizeof(float);
int max_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_blocks, target_kernel, bs, dynamic_smem);
int active = max_blocks * (bs / 32);
float occ = (float)active / max_warps * 100.0f;
printf("%10d | %9d | %12d | %7.1f%%\n", bs, max_blocks, active, occ);
}
}
Launch Bounds: Controlling the Compiler
The __launch_bounds__ qualifier tells the compiler the maximum threads per block and optionally the minimum blocks per SM. The compiler uses this to decide how many registers to allocate:
// Syntax: __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)
// Example 1: Promise at most 256 threads/block, at least 4 blocks/SM
// Compiler target: 256 * 4 = 1024 threads, max 65536/1024 = 64 regs/thread
__global__ void __launch_bounds__(256, 4)
kernel_high_occupancy(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) data[idx] *= 2.0f;
}
// Example 2: 256 threads, at least 8 blocks/SM
// Compiler target: 256 * 8 = 2048 threads, max 65536/2048 = 32 regs/thread
__global__ void __launch_bounds__(256, 8)
kernel_max_occupancy(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) data[idx] *= 2.0f;
}
// Example 3: Large block, fewer blocks
// 1024 threads, 2 blocks -> 2048 threads, 32 regs/thread
__global__ void __launch_bounds__(1024, 2)
kernel_large_block(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) data[idx] *= 2.0f;
}
Increasing minBlocksPerMultiprocessor forces the compiler to use fewer registers, which increases occupancy but may cause register spilling. The second parameter is optional — omitting it gives the compiler freedom to use as many registers as it wants, constrained only by the first parameter.
Checking Actual Register Usage
# Compile with verbose PTX info
nvcc -Xptxas -v -arch=sm_80 kernel.cu -o kernel
# Output looks like:
# ptxas info: Function properties for _Z17kernel_high_occupancyPfi
# 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
# ptxas info: Used 16 registers, 0 bytes smem, 0 bytes cmem[0]
# Or use cuobjdump for compiled binaries:
cuobjdump --dump-resource-usage kernel
When Maximum Occupancy Does Not Help
The relationship between occupancy and performance is not monotonic. Higher occupancy provides more warps to hide latency, but at the cost of fewer resources per thread (registers, shared memory). The inflection point depends on whether the kernel is memory-bound, compute-bound, or latency-bound.
Memory-Bound Kernels
For kernels that saturate memory bandwidth at moderate occupancy, increasing occupancy provides no benefit:
#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>
// Simple memory copy kernel — purely bandwidth-bound
__global__ void copy_kernel(const float* __restrict__ src,
float* __restrict__ dst, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n; i += stride) {
dst[i] = src[i];
}
}
void benchmark_occupancy_vs_perf(int n) {
float *d_src, *d_dst;
cudaMalloc(&d_src, n * sizeof(float));
cudaMalloc(&d_dst, n * sizeof(float));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
int num_sms;
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
// Test different occupancy levels by varying grid size
int block_size = 256;
int warps_per_block = block_size / 32; // 8
struct OccTest { int blocks_per_sm; float expected_occ; };
OccTest tests[] = {
{1, 12.5f}, {2, 25.0f}, {4, 50.0f}, {8, 100.0f}
};
for (auto& t : tests) {
int grid_size = t.blocks_per_sm * num_sms;
float ms;
// Warmup
copy_kernel<<<grid_size, block_size>>>(d_src, d_dst, n);
cudaEventRecord(start);
for (int i = 0; i < 100; i++) {
copy_kernel<<<grid_size, block_size>>>(d_src, d_dst, n);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&ms, start, stop);
float gb = 2.0f * n * sizeof(float) * 100 / 1e9;
float bw = gb / (ms / 1000.0f);
printf("Occupancy ~%5.1f%%: %.1f GB/s\n", t.expected_occ, bw);
}
cudaFree(d_src);
cudaFree(d_dst);
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
int main() {
benchmark_occupancy_vs_perf(1 << 26); // 256M floats
return 0;
}
Copy Kernel: Occupancy vs Effective Bandwidth (A100, 256M floats)
| Occupancy | Active Warps/SM | Bandwidth (GB/s) | % of Peak |
|---|---|---|---|
| 12.5% | 8 | 1210 | 61% |
| 25% | 16 | 1640 | 83% |
| 50% | 32 | 1850 | 94% |
| 75% | 48 | 1870 | 95% |
| 100% | 64 | 1880 | 95% |
Copy Kernel Bandwidth vs Occupancy
(GB/s)Compute-Bound Kernels: More Registers Win
For compute-intensive kernels with high arithmetic intensity, using more registers (lower occupancy) can outperform fewer registers (higher occupancy):
// A compute-intensive kernel: many registers needed for intermediate values
__global__ void __launch_bounds__(256, 2) // Allow up to 128 regs/thread
compute_heavy_low_occ(const float* __restrict__ in, float* __restrict__ out, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= n) return;
float x = in[idx];
// Heavy computation with many intermediate values
float a = x * x;
float b = a * x + 3.0f;
float c = b * b - a;
float d = sqrtf(fabsf(c));
float e = sinf(d) + cosf(a);
float f = e * d + b * c;
float g = logf(fabsf(f) + 1.0f);
float h = expf(-g * 0.1f);
float result = h * a + f * g - d * e + c * b;
out[idx] = result;
}
__global__ void __launch_bounds__(256, 8) // Force max occupancy, fewer regs
compute_heavy_high_occ(const float* __restrict__ in, float* __restrict__ out, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= n) return;
float x = in[idx];
// Same computation — compiler may spill registers
float a = x * x;
float b = a * x + 3.0f;
float c = b * b - a;
float d = sqrtf(fabsf(c));
float e = sinf(d) + cosf(a);
float f = e * d + b * c;
float g = logf(fabsf(f) + 1.0f);
float h = expf(-g * 0.1f);
float result = h * a + f * g - d * e + c * b;
out[idx] = result;
}
Compute-Heavy Kernel: Occupancy vs Throughput
| Version | Regs/Thread | Occupancy | Time (ms) | Spill Loads |
|---|---|---|---|---|
| Low occupancy | 64 | 50% | 2.1 | 0 |
| High occupancy | 32 | 100% | 2.8 | 24 |
Shared Memory Carveout Configuration
On Ampere and later, shared memory and L1 cache share a unified on-chip memory pool. You can configure the split per kernel:
#include <cuda_runtime.h>
// Default: implementation-defined split (typically 128 KB smem, 36 KB L1 on A100)
__global__ void kernel_needs_lots_of_smem(float* data) {
__shared__ float buffer[32768]; // 128 KB
// ...
}
void configure_and_launch() {
// Option 1: Set maximum dynamic shared memory per kernel
cudaFuncSetAttribute(
kernel_needs_lots_of_smem,
cudaFuncAttributeMaxDynamicSharedMemorySize,
164 * 1024); // 164 KB
// Option 2: Set preferred shared memory carveout (fraction)
cudaFuncSetAttribute(
kernel_needs_lots_of_smem,
cudaFuncAttributePreferredSharedMemoryCarveout,
100); // Request 100% for shared memory (hardware picks closest supported)
// Launch
kernel_needs_lots_of_smem<<<108, 256>>>(data);
}
The impact on occupancy is direct. If you configure 164 KB for shared memory and your kernel uses 100 KB per block, only 1 block fits per SM. With 256 threads per block, that is 8 warps = 12.5% occupancy.
// Occupancy analysis for large shared memory kernels
void analyze_large_smem_kernel() {
int block_size = 256;
size_t smem_sizes[] = {16*1024, 32*1024, 48*1024, 64*1024, 96*1024, 128*1024};
for (auto smem : smem_sizes) {
int max_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_blocks, kernel_needs_lots_of_smem, block_size, smem);
int active_warps = max_blocks * (block_size / 32);
float occ = active_warps / 64.0f * 100.0f;
printf("Smem %3zu KB: %d blocks/SM, %d warps, %.1f%% occupancy\n",
smem / 1024, max_blocks, active_warps, occ);
}
}
Occupancy vs Shared Memory Per Block (A100, 256 threads/block)
| Shared Mem/Block | Max Blocks/SM | Active Warps | Occupancy |
|---|---|---|---|
| 16 KB | 10 | 64 (capped) | 100% |
| 32 KB | 5 | 40 | 62.5% |
| 48 KB | 3 | 24 | 37.5% |
| 64 KB | 2 | 16 | 25% |
| 96 KB | 1 | 8 | 12.5% |
| 128 KB | 1 | 8 | 12.5% |
Block Size Selection Strategies
Strategy 1: Maximize Occupancy
Use cudaOccupancyMaxPotentialBlockSize to find the block size that maximizes occupancy for a given kernel:
template <typename KernelFunc>
int find_best_block_size(KernelFunc kernel, size_t dynamic_smem = 0) {
int min_grid, block_size;
cudaOccupancyMaxPotentialBlockSize(&min_grid, &block_size, kernel, dynamic_smem, 0);
return block_size;
}
Strategy 2: Empirical Sweep
Occupancy is a proxy for performance, not performance itself. Sweep block sizes and measure:
#include <cuda_runtime.h>
#include <cstdio>
#include <vector>
template <typename Func>
void sweep_and_benchmark(Func kernel, float* d_in, float* d_out, int n,
const char* name) {
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
printf("\n=== %s ===\n", name);
printf("Block Size | Blocks/SM | Occupancy | Time (us)\n");
int best_bs = 0;
float best_time = 1e9;
for (int bs = 32; bs <= 1024; bs *= 2) {
int max_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks, kernel, bs, 0);
int active_warps = max_blocks * (bs / 32);
float occ = active_warps / 64.0f * 100.0f;
int grid = (n + bs - 1) / bs;
// Warmup
kernel<<<grid, bs>>>(d_in, d_out, n);
cudaDeviceSynchronize();
// Benchmark
cudaEventRecord(start);
for (int i = 0; i < 100; i++) {
kernel<<<grid, bs>>>(d_in, d_out, n);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
float us = ms * 10.0f; // Per-invocation microseconds
printf("%10d | %9d | %8.1f%% | %8.1f\n", bs, max_blocks, occ, us);
if (us < best_time) {
best_time = us;
best_bs = bs;
}
}
printf("Best: block size %d at %.1f us\n", best_bs, best_time);
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
Strategy 3: Tile-Driven Block Size
For tiled algorithms (matrix multiply, stencil), the block size is determined by the tile dimensions, and shared memory usage is determined by the tile size. Occupancy is a consequence, not a goal:
// GEMM tile: 128x128 block tile, 8x8 thread tile
// Block: (128/8) x (128/8) = 16x16 = 256 threads
// Smem: 2 tiles of 128x16 floats = 2 * 128 * 16 * 4 = 16 KB
// Registers: ~64 per thread for 8x8 accumulator + operands
// Occupancy: 65536 / (64 * 32) = 32 warps = 50%
// Smaller tile: 64x64 block tile, 4x4 thread tile
// Block: (64/4) x (64/4) = 16x16 = 256 threads
// Smem: 2 * 64 * 16 * 4 = 8 KB
// Registers: ~32 per thread
// Occupancy: 65536 / (32 * 32) = 64 warps = 100%
// The 128x128 tile has better cache reuse and instruction-level parallelism
// despite lower occupancy, so it typically wins for large matrices.
Occupancy and Warp Scheduling Interactions
The SM has 4 warp schedulers on Ampere. Each scheduler needs ready warps to issue instructions. The minimum useful occupancy depends on memory latency:
For global memory with ~400 cycle latency and 4 schedulers (each issuing 1 warp/cycle):
With out-of-order-like execution within a warp and multiple outstanding loads, approximately 12-16 active warps typically suffice to hide global memory latency on an A100. This is 18.75%-25% occupancy.
For compute-bound kernels, the minimum is lower — about 4-8 warps (one or two per scheduler) if there are no memory stalls.
// Demonstrate: instrument a kernel to count active warps
__global__ void warp_count_kernel(float* data, int n, int* warp_count) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// Count active warps on this SM using clock and atomics
// (This is a diagnostic tool, not production code)
unsigned int smid;
asm("mov.u32 %0, %%smid;" : "=r"(smid));
// One thread per warp reports
if (threadIdx.x % 32 == 0) {
atomicAdd(&warp_count[smid], 1);
}
if (idx < n) {
data[idx] *= 2.0f;
}
}
Multi-Dimensional Block Configurations
For 2D and 3D kernels, the total thread count is what matters for occupancy, but the dimension layout affects memory access patterns:
#include <cuda_runtime.h>
#include <cstdio>
// All of these have 256 threads total, same occupancy:
// dim3(256, 1, 1), dim3(128, 2, 1), dim3(64, 4, 1), dim3(32, 8, 1), dim3(16, 16, 1)
// But memory access differs. For row-major matrix access:
// A[row][col], where row = blockIdx.y * blockDim.y + threadIdx.y
// col = blockIdx.x * blockDim.x + threadIdx.x
// Warp 0: threads 0-31, which maps to threadIdx.x = 0-15 with threadIdx.y = 0-1
// (for dim3(16, 16, 1))
// For coalesced column access, need threadIdx.x to vary fastest within a warp
// dim3(32, 8) guarantees 32 consecutive threadIdx.x values per warp
__global__ void matrix_kernel_2d(const float* __restrict__ A,
float* __restrict__ B,
int rows, int cols) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row < rows && col < cols) {
// A[row * cols + col]: consecutive threads (same row, adjacent cols) -> coalesced
B[row * cols + col] = A[row * cols + col] * 2.0f;
}
}
void launch_2d() {
int rows = 4096, cols = 4096;
// dim3(32, 8): 32 threads in x -> full warp reads consecutive columns -> coalesced
dim3 block(32, 8); // 256 threads total
dim3 grid((cols + 31) / 32, (rows + 7) / 8);
// matrix_kernel_2d<<<grid, block>>>(...);
}
CUDA maps threads to warps in row-major order: warpId = (threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x) / 32. For a dim3(16, 16) block, warp 0 contains threads with threadIdx.y=0, threadIdx.x=0-15 and threadIdx.y=1, threadIdx.x=0-15. This means adjacent threads in a warp span 2 rows, which can hurt access patterns.
Nsight Compute Occupancy Analysis
Nsight Compute reports achieved occupancy (the actual average, which may be lower than theoretical due to tail effects, partial waves, etc.) alongside theoretical occupancy:
# Profile occupancy metrics
ncu --metrics \
sm__warps_active.avg.pct_of_peak_sustained_active,\
sm__warps_active.avg.per_cycle_active,\
launch__occupancy_limit_registers,\
launch__occupancy_limit_shared_mem,\
launch__occupancy_limit_blocks,\
launch__occupancy_limit_warps,\
launch__registers_per_thread,\
launch__shared_mem_per_block_dynamic,\
launch__shared_mem_per_block_static,\
launch__block_size \
./my_kernel
# Key metrics to look for:
# sm__warps_active.avg.pct_of_peak_sustained_active = achieved occupancy %
# launch__occupancy_limit_registers = theoretical max due to registers
# launch__occupancy_limit_shared_mem = theoretical max due to smem
# launch__occupancy_limit_blocks = theoretical max due to block slots
# launch__occupancy_limit_warps = final theoretical occupancy
Interpreting the Occupancy Section
Section: Occupancy
Theoretical Occupancy 75.0% 48/64 warps
Achieved Occupancy 68.3% 43.7/64 warps
Achieved Active Warps 43.7
Block Limit SM 32 Not limiting
Block Limit Registers 6 LIMITING (6 blocks * 8 warps = 48)
Block Limit Shared Mem 10 Not limiting
Block Limit Warps 8 Not limiting
Registers Per Thread 40 Effective: 40
Static Shared Memory 0 B
Dynamic Shared Memory 8192 B
Block Size 256
In this output, registers are the limiter. The achieved occupancy (68.3%) is lower than theoretical (75%) because the last wave of blocks does not fully populate all SMs.
Tail Effects and Wave Quantization
When the total number of blocks is not a multiple of (num_SMs * blocks_per_SM), the last wave has fewer blocks than the SM capacity. This reduces achieved occupancy:
#include <cuda_runtime.h>
#include <cstdio>
void analyze_tail_effect(int n, int block_size) {
int num_sms = 108; // A100
int blocks_per_sm = 8; // Assuming 100% occupancy with this block size
int full_wave = num_sms * blocks_per_sm; // 864 blocks per wave
int total_blocks = (n + block_size - 1) / block_size;
int full_waves = total_blocks / full_wave;
int tail_blocks = total_blocks % full_wave;
float tail_utilization = (float)tail_blocks / full_wave * 100.0f;
printf("Total blocks: %d\n", total_blocks);
printf("Full waves: %d (%d blocks each)\n", full_waves, full_wave);
printf("Tail wave: %d blocks (%.1f%% utilization)\n", tail_blocks, tail_utilization);
// For short kernels (1-2 waves), tail effect dominates
if (full_waves <= 2) {
printf("WARNING: Tail effect is significant. Consider adjusting problem size.\n");
}
}
For large problems with hundreds of waves, the tail effect is negligible. For small problems launched frequently (e.g., per-token operations in LLM inference with batch size 1), the tail wave can be a significant fraction of total execution time. In these cases, choosing block sizes that minimize tail waste matters more than maximizing theoretical occupancy.
Putting It Together: Occupancy Tuning Workflow
#include <cuda_runtime.h>
#include <cstdio>
// Step 1: Write kernel with __launch_bounds__ based on initial estimate
__global__ void __launch_bounds__(256, 4)
production_kernel(const float* __restrict__ input,
float* __restrict__ output,
int n) {
extern __shared__ float smem[];
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= n) return;
// Load to shared memory
smem[threadIdx.x] = input[idx];
__syncthreads();
// Compute with register-heavy operations
float val = smem[threadIdx.x];
float acc = 0.0f;
for (int i = 0; i < 16; i++) {
int neighbor = (threadIdx.x + i) % blockDim.x;
acc += smem[neighbor] * (float)(i + 1);
}
output[idx] = acc + val;
}
// Step 2: Query occupancy at compile-time register usage
void tune_kernel() {
int block_sizes[] = {64, 128, 256, 512, 1024};
int best_bs = 0;
int best_occ = 0;
for (int bs : block_sizes) {
size_t smem = bs * sizeof(float);
int max_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_blocks, production_kernel, bs, smem);
int warps = max_blocks * (bs / 32);
if (warps > best_occ) {
best_occ = warps;
best_bs = bs;
}
printf("BS=%4d: %2d blocks/SM, %2d warps, %.1f%% occ\n",
bs, max_blocks, warps, warps / 64.0f * 100.0f);
}
// Step 3: Benchmark top candidates
int n = 1 << 24;
float *d_in, *d_out;
cudaMalloc(&d_in, n * sizeof(float));
cudaMalloc(&d_out, n * sizeof(float));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
for (int bs : block_sizes) {
int grid = (n + bs - 1) / bs;
size_t smem = bs * sizeof(float);
production_kernel<<<grid, bs, smem>>>(d_in, d_out, n);
cudaDeviceSynchronize();
cudaEventRecord(start);
for (int i = 0; i < 100; i++) {
production_kernel<<<grid, bs, smem>>>(d_in, d_out, n);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
printf("BS=%4d: %.2f ms/launch\n", bs, ms / 100.0f);
}
cudaFree(d_in);
cudaFree(d_out);
}
Decision Framework
- Start with 256 threads/block — it is a safe default that works well across most kernels
- Profile with Nsight Compute — check which resource limits occupancy
- If register-limited: try
__launch_bounds__with differentminBlocksPerMultiprocessorvalues; check for spills - If shared-memory-limited: reduce shared memory usage, or accept lower occupancy if the shared memory provides sufficient data reuse
- If block-slot-limited: increase threads per block
- Always benchmark — occupancy is a useful diagnostic, not a performance guarantee
Occupancy Tuning Summary: Relative Kernel Performance
(GFLOPS)The best configuration is often not the one with highest occupancy. Occupancy is a necessary but not sufficient condition for performance. The right approach is to understand what limits your kernel, use occupancy analysis to diagnose resource bottlenecks, and validate with actual measurements.
Summary
Occupancy is determined by three resources: registers per thread, shared memory per block, and block slots per SM. The CUDA Occupancy API (cudaOccupancyMaxActiveBlocksPerMultiprocessor, cudaOccupancyMaxPotentialBlockSize) automates the calculation. __launch_bounds__ controls register allocation at compile time. Nsight Compute reports both theoretical and achieved occupancy, with the achieved value accounting for tail effects. The key engineering insight: occupancy is a diagnostic tool, not a performance target. Memory-bound kernels saturate bandwidth at 25-50% occupancy. Compute-bound kernels may run faster at lower occupancy if it avoids register spilling. Always measure.