Part of Series CUDA Kernel Engineering 11 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

A kernel with 100% occupancy achieves 250 GB/s on an A100. Reduce occupancy to 50% and throughput jumps to 310 GB/s — the kernel is faster at lower occupancy. This counterintuitive result happens when register pressure is the bottleneck: reducing occupancy gives each thread more registers, eliminating register spills to local memory that cost 400+ cycles each. Occupancy is not a goal. It is a tuning knob. High occupancy hides memory latency through thread-level parallelism. Low occupancy provides more registers and shared memory per thread. The optimal point depends on whether your kernel is latency-bound or resource-bound.

All measurements target NVIDIA Ampere (A100-80GB SXM, SM 8.0) unless stated otherwise.

The Three Occupancy Limiters

An SM has a fixed pool of each resource. Every block assigned to the SM consumes a portion. Once any resource is exhausted, no more blocks can be assigned, even if the other resources have capacity remaining.

Resource Pools by Architecture

📊

Per-SM Resource Limits by Architecture

ArchitectureMax ThreadsMax WarpsMax BlocksRegistersMax Shared Mem
Volta (SM 7.0) 2048 64 32 65536 96 KB
Turing (SM 7.5) 1024 32 16 65536 64 KB
Ampere (SM 8.0) 2048 64 32 65536 164 KB
Ada (SM 8.9) 1536 48 24 65536 100 KB
Hopper (SM 9.0) 2048 64 32 65536 228 KB
Note: These are the maximum values. Actual limits depend on the carveout configuration for shared memory vs L1 cache.

Limiter 1: Registers Per Thread

The register file on an A100 SM is 65,536 32-bit registers. These are partitioned across all active threads. If your kernel uses 32 registers per thread and you want 2048 threads active:

Registers needed=2048×32=65536\text{Registers needed} = 2048 \times 32 = 65536

That exactly fills the register file, so you get 100% occupancy. If the kernel uses 40 registers per thread:

Max threads=65536/40=1638\text{Max threads} = \lfloor 65536 / 40 \rfloor = 1638

But register allocation happens in granules. On SM 8.0, registers are allocated in units of 256 registers per warp (8 registers per thread, since a warp is 32 threads). So the effective registers per thread rounds up to the next multiple of 8:

Effective registers/thread=40/8×8=40 (already aligned)\text{Effective registers/thread} = \lceil 40 / 8 \rceil \times 8 = 40 \text{ (already aligned)}

Max warps=65536/(40×32)=65536/1280=51\text{Max warps} = \lfloor 65536 / (40 \times 32) \rfloor = \lfloor 65536 / 1280 \rfloor = 51

51 warps out of a maximum 64 gives 79.7% occupancy.

// Check register usage at compile time
// nvcc -Xptxas -v my_kernel.cu
// Output: Used 32 registers, ... bytes smem, ... bytes cmem

// Or use --maxrregcount to cap registers:
// nvcc --maxrregcount=32 my_kernel.cu

// Per-kernel register cap using launch_bounds:
__global__ void __launch_bounds__(256, 4)  // maxThreadsPerBlock, minBlocksPerMultiprocessor
my_kernel(float* data, int n) {
    // With 256 threads/block and 4 blocks/SM:
    // 256 * 4 = 1024 threads, need 1024 * regs <= 65536
    // So compiler will use at most 64 registers per thread
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        data[idx] *= 2.0f;
    }
}
⚠️ Register Spilling

When you constrain register usage with --maxrregcount or __launch_bounds__, the compiler may spill registers to local memory. Local memory is backed by L1/L2 cache and, on miss, global memory (DRAM). A kernel that spills 8 registers and gets 100% occupancy can be slower than one that uses all 64 registers at 50% occupancy. Always benchmark.

Limiter 2: Shared Memory Per Block

On an A100, the maximum shared memory per SM is 164 KB (configurable; default may be lower). Shared memory is allocated per block, not per thread. If a kernel uses 48 KB of shared memory per block:

Max blocks per SM=164/48=3\text{Max blocks per SM} = \lfloor 164 / 48 \rfloor = 3

With 256 threads per block, that is 3×256=7683 \times 256 = 768 threads, or 768/32=24768 / 32 = 24 warps. Occupancy = 24/64=37.5%24 / 64 = 37.5\%.

// Configure shared memory carveout (Ampere+)
cudaFuncSetAttribute(my_kernel,
    cudaFuncAttributeMaxDynamicSharedMemorySize,
    164 * 1024);  // Request 164 KB

// Shared memory allocation granularity is 256 bytes on Ampere
// A kernel requesting 1 byte of shared memory actually reserves 256 bytes
__global__ void kernel_with_smem() {
    __shared__ float tile[32][33];  // 32*33*4 = 4224 bytes -> rounds to 4352 (17 * 256)
    // ...
}

// Dynamic shared memory:
__global__ void kernel_dynamic_smem(float* out) {
    extern __shared__ float smem[];
    // Size specified at launch: kernel<<<grid, block, smem_bytes>>>()
}

Limiter 3: Block Slots

Each SM has a maximum number of resident blocks (32 on A100). This becomes the limiter when blocks are very small. A block of 32 threads (1 warp) would need 64 blocks to reach max occupancy, but the SM only supports 32 blocks, capping you at 32 warps = 50% occupancy.

Warps per block=threads per block/32\text{Warps per block} = \lceil \text{threads per block} / 32 \rceil Max blocks (slot limit)=32\text{Max blocks (slot limit)} = 32 Max warps (slot limit)=32×warps per block\text{Max warps (slot limit)} = 32 \times \text{warps per block}

This is why block sizes below 128 are rarely optimal. At 64 threads (2 warps per block), 32×2=6432 \times 2 = 64 warps fills the SM. At 32 threads, you lose half your capacity to the block slot limit.

Occupancy Calculation: Complete Walkthrough

Given a kernel with 40 registers per thread, 8 KB of shared memory per block, launched with 256 threads per block, on an A100:

Step 1: Register constraint

Effective regs/thread=40/8×8=40\text{Effective regs/thread} = \lceil 40/8 \rceil \times 8 = 40 Regs per warp=40×32=1280\text{Regs per warp} = 40 \times 32 = 1280 Max warps (register)=65536/1280=51\text{Max warps (register)} = \lfloor 65536 / 1280 \rfloor = 51

Step 2: Shared memory constraint

Smem per block (rounded)=8192/256×256=8192 (already aligned)\text{Smem per block (rounded)} = \lceil 8192 / 256 \rceil \times 256 = 8192 \text{ (already aligned)} Max blocks (smem)=167936/8192=20\text{Max blocks (smem)} = \lfloor 167936 / 8192 \rfloor = 20 Max warps (smem)=20×8=160\text{Max warps (smem)} = 20 \times 8 = 160

Step 3: Block slot constraint

Warps per block=256/32=8\text{Warps per block} = 256 / 32 = 8 Max blocks (slot)=32\text{Max blocks (slot)} = 32 Max warps (slot)=32×8=256\text{Max warps (slot)} = 32 \times 8 = 256

Step 4: Thread limit

Max warps (thread limit)=64\text{Max warps (thread limit)} = 64

Result: min(51,160,256,64)=51\min(51, 160, 256, 64) = 51 warps. Occupancy = 51/64=79.7%51 / 64 = 79.7\%.

The register constraint is the active limiter. Reducing register usage from 40 to 32 would allow:

65536/(32×32)=64 warps=100%\lfloor 65536 / (32 \times 32) \rfloor = 64 \text{ warps} = 100\%

#include <cuda_runtime.h>
#include <cstdio>

// Manual occupancy calculation
void calculate_occupancy(int regs_per_thread, int smem_per_block, int threads_per_block) {
    // A100 constants
    const int max_warps_per_sm = 64;
    const int max_blocks_per_sm = 32;
    const int total_regs = 65536;
    const int total_smem = 164 * 1024;  // After carveout config
    const int reg_alloc_granularity = 8;  // regs per thread, must be multiple of 8

    int warps_per_block = (threads_per_block + 31) / 32;

    // Register constraint
    int effective_regs = ((regs_per_thread + reg_alloc_granularity - 1)
                          / reg_alloc_granularity) * reg_alloc_granularity;
    int regs_per_warp = effective_regs * 32;
    int max_warps_reg = total_regs / regs_per_warp;

    // Shared memory constraint
    int smem_rounded = ((smem_per_block + 255) / 256) * 256;
    int max_blocks_smem = (smem_rounded > 0) ? total_smem / smem_rounded : max_blocks_per_sm;
    int max_warps_smem = max_blocks_smem * warps_per_block;

    // Block slot constraint
    int max_warps_slot = max_blocks_per_sm * warps_per_block;

    // Final
    int active_warps = max_warps_reg;
    if (active_warps > max_warps_smem) active_warps = max_warps_smem;
    if (active_warps > max_warps_slot) active_warps = max_warps_slot;
    if (active_warps > max_warps_per_sm) active_warps = max_warps_per_sm;

    // Must be a multiple of warps_per_block
    int active_blocks = active_warps / warps_per_block;
    active_warps = active_blocks * warps_per_block;

    float occupancy = (float)active_warps / max_warps_per_sm * 100.0f;

    printf("Regs/thread: %d (effective: %d)\n", regs_per_thread, effective_regs);
    printf("Smem/block: %d (rounded: %d)\n", smem_per_block, smem_rounded);
    printf("Threads/block: %d, Warps/block: %d\n", threads_per_block, warps_per_block);
    printf("Max warps: reg=%d, smem=%d, slot=%d, hw=%d\n",
           max_warps_reg, max_warps_smem, max_warps_slot, max_warps_per_sm);
    printf("Active blocks: %d, Active warps: %d\n", active_blocks, active_warps);
    printf("Occupancy: %.1f%%\n", occupancy);
}

The CUDA Occupancy API

CUDA provides runtime functions to compute occupancy without manual calculation:

#include <cuda_runtime.h>
#include <cstdio>

__global__ void example_kernel(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        // Some work that uses a moderate number of registers
        float x = data[idx];
        float y = x * x + 3.0f * x - 7.0f;
        float z = sqrtf(y > 0 ? y : -y);
        data[idx] = z;
    }
}

void query_occupancy() {
    int device;
    cudaGetDevice(&device);

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, device);

    // Method 1: cudaOccupancyMaxActiveBlocksPerMultiprocessor
    int block_size = 256;
    int dynamic_smem = 0;
    int max_blocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_blocks, example_kernel, block_size, dynamic_smem);

    int max_warps = prop.maxThreadsPerMultiProcessor / 32;
    int active_warps = max_blocks * (block_size / 32);
    float occupancy = (float)active_warps / max_warps;

    printf("Block size %d: %d blocks/SM, %.1f%% occupancy\n",
           block_size, max_blocks, occupancy * 100.0f);

    // Method 2: Let CUDA choose optimal block size
    int min_grid_size, opt_block_size;
    cudaOccupancyMaxPotentialBlockSize(
        &min_grid_size, &opt_block_size, example_kernel, dynamic_smem, 0);
    printf("Optimal block size: %d (min grid: %d)\n", opt_block_size, min_grid_size);

    // Method 3: With dynamic shared memory that depends on block size
    auto smem_config = [](int block_size) -> size_t {
        return block_size * sizeof(float);  // 1 float per thread
    };
    cudaOccupancyMaxPotentialBlockSizeVariableSMem(
        &min_grid_size, &opt_block_size, example_kernel, smem_config, 0);
    printf("Optimal block size (variable smem): %d\n", opt_block_size);
}

Sweeping Block Sizes

#include <cuda_runtime.h>
#include <cstdio>

__global__ void target_kernel(float* data, int n) {
    extern __shared__ float smem[];
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        smem[threadIdx.x] = data[idx];
        __syncthreads();
        data[idx] = smem[threadIdx.x] * 2.0f;
    }
}

void sweep_block_sizes() {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    int max_warps = prop.maxThreadsPerMultiProcessor / 32;

    printf("Block Size | Blocks/SM | Active Warps | Occupancy\n");
    printf("-----------|-----------|--------------|----------\n");

    for (int bs = 32; bs <= 1024; bs += 32) {
        int dynamic_smem = bs * sizeof(float);
        int max_blocks;
        cudaOccupancyMaxActiveBlocksPerMultiprocessor(
            &max_blocks, target_kernel, bs, dynamic_smem);

        int active = max_blocks * (bs / 32);
        float occ = (float)active / max_warps * 100.0f;
        printf("%10d | %9d | %12d | %7.1f%%\n", bs, max_blocks, active, occ);
    }
}

Launch Bounds: Controlling the Compiler

The __launch_bounds__ qualifier tells the compiler the maximum threads per block and optionally the minimum blocks per SM. The compiler uses this to decide how many registers to allocate:

// Syntax: __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)

// Example 1: Promise at most 256 threads/block, at least 4 blocks/SM
// Compiler target: 256 * 4 = 1024 threads, max 65536/1024 = 64 regs/thread
__global__ void __launch_bounds__(256, 4)
kernel_high_occupancy(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) data[idx] *= 2.0f;
}

// Example 2: 256 threads, at least 8 blocks/SM
// Compiler target: 256 * 8 = 2048 threads, max 65536/2048 = 32 regs/thread
__global__ void __launch_bounds__(256, 8)
kernel_max_occupancy(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) data[idx] *= 2.0f;
}

// Example 3: Large block, fewer blocks
// 1024 threads, 2 blocks -> 2048 threads, 32 regs/thread
__global__ void __launch_bounds__(1024, 2)
kernel_large_block(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) data[idx] *= 2.0f;
}
launch_bounds Tradeoffs

Increasing minBlocksPerMultiprocessor forces the compiler to use fewer registers, which increases occupancy but may cause register spilling. The second parameter is optional — omitting it gives the compiler freedom to use as many registers as it wants, constrained only by the first parameter.

Checking Actual Register Usage

# Compile with verbose PTX info
nvcc -Xptxas -v -arch=sm_80 kernel.cu -o kernel

# Output looks like:
# ptxas info: Function properties for _Z17kernel_high_occupancyPfi
#     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
# ptxas info: Used 16 registers, 0 bytes smem, 0 bytes cmem[0]

# Or use cuobjdump for compiled binaries:
cuobjdump --dump-resource-usage kernel

When Maximum Occupancy Does Not Help

The relationship between occupancy and performance is not monotonic. Higher occupancy provides more warps to hide latency, but at the cost of fewer resources per thread (registers, shared memory). The inflection point depends on whether the kernel is memory-bound, compute-bound, or latency-bound.

Memory-Bound Kernels

For kernels that saturate memory bandwidth at moderate occupancy, increasing occupancy provides no benefit:

#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>

// Simple memory copy kernel — purely bandwidth-bound
__global__ void copy_kernel(const float* __restrict__ src,
                            float* __restrict__ dst, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = idx; i < n; i += stride) {
        dst[i] = src[i];
    }
}

void benchmark_occupancy_vs_perf(int n) {
    float *d_src, *d_dst;
    cudaMalloc(&d_src, n * sizeof(float));
    cudaMalloc(&d_dst, n * sizeof(float));

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    int num_sms;
    cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);

    // Test different occupancy levels by varying grid size
    int block_size = 256;
    int warps_per_block = block_size / 32;  // 8

    struct OccTest { int blocks_per_sm; float expected_occ; };
    OccTest tests[] = {
        {1, 12.5f}, {2, 25.0f}, {4, 50.0f}, {8, 100.0f}
    };

    for (auto& t : tests) {
        int grid_size = t.blocks_per_sm * num_sms;
        float ms;

        // Warmup
        copy_kernel<<<grid_size, block_size>>>(d_src, d_dst, n);

        cudaEventRecord(start);
        for (int i = 0; i < 100; i++) {
            copy_kernel<<<grid_size, block_size>>>(d_src, d_dst, n);
        }
        cudaEventRecord(stop);
        cudaEventSynchronize(stop);
        cudaEventElapsedTime(&ms, start, stop);

        float gb = 2.0f * n * sizeof(float) * 100 / 1e9;
        float bw = gb / (ms / 1000.0f);
        printf("Occupancy ~%5.1f%%: %.1f GB/s\n", t.expected_occ, bw);
    }

    cudaFree(d_src);
    cudaFree(d_dst);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
}

int main() {
    benchmark_occupancy_vs_perf(1 << 26);  // 256M floats
    return 0;
}
📊

Copy Kernel: Occupancy vs Effective Bandwidth (A100, 256M floats)

OccupancyActive Warps/SMBandwidth (GB/s)% of Peak
12.5% 8 1210 61%
25% 16 1640 83%
50% 32 1850 94%
75% 48 1870 95%
100% 64 1880 95%
Note: Beyond 50% occupancy, bandwidth gains are marginal. The memory system saturates with ~32 warps per SM on the A100. The extra warps just queue up at the memory controller.

Copy Kernel Bandwidth vs Occupancy

(GB/s)
12.5%
1,210 GB/s
25%
1,640 GB/s
50% Saturated
1,850 GB/s
75%
1,870 GB/s
100%
1,880 GB/s

Compute-Bound Kernels: More Registers Win

For compute-intensive kernels with high arithmetic intensity, using more registers (lower occupancy) can outperform fewer registers (higher occupancy):

// A compute-intensive kernel: many registers needed for intermediate values
__global__ void __launch_bounds__(256, 2)  // Allow up to 128 regs/thread
compute_heavy_low_occ(const float* __restrict__ in, float* __restrict__ out, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= n) return;

    float x = in[idx];

    // Heavy computation with many intermediate values
    float a = x * x;
    float b = a * x + 3.0f;
    float c = b * b - a;
    float d = sqrtf(fabsf(c));
    float e = sinf(d) + cosf(a);
    float f = e * d + b * c;
    float g = logf(fabsf(f) + 1.0f);
    float h = expf(-g * 0.1f);
    float result = h * a + f * g - d * e + c * b;

    out[idx] = result;
}

__global__ void __launch_bounds__(256, 8)  // Force max occupancy, fewer regs
compute_heavy_high_occ(const float* __restrict__ in, float* __restrict__ out, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= n) return;

    float x = in[idx];

    // Same computation — compiler may spill registers
    float a = x * x;
    float b = a * x + 3.0f;
    float c = b * b - a;
    float d = sqrtf(fabsf(c));
    float e = sinf(d) + cosf(a);
    float f = e * d + b * c;
    float g = logf(fabsf(f) + 1.0f);
    float h = expf(-g * 0.1f);
    float result = h * a + f * g - d * e + c * b;

    out[idx] = result;
}
📊

Compute-Heavy Kernel: Occupancy vs Throughput

VersionRegs/ThreadOccupancyTime (ms)Spill Loads
Low occupancy 64 50% 2.1 0
High occupancy 32 100% 2.8 24
Note: The high-occupancy version spills registers to local memory. Despite double the active warps, it runs 33% slower due to spill traffic.

Shared Memory Carveout Configuration

On Ampere and later, shared memory and L1 cache share a unified on-chip memory pool. You can configure the split per kernel:

#include <cuda_runtime.h>

// Default: implementation-defined split (typically 128 KB smem, 36 KB L1 on A100)

__global__ void kernel_needs_lots_of_smem(float* data) {
    __shared__ float buffer[32768];  // 128 KB
    // ...
}

void configure_and_launch() {
    // Option 1: Set maximum dynamic shared memory per kernel
    cudaFuncSetAttribute(
        kernel_needs_lots_of_smem,
        cudaFuncAttributeMaxDynamicSharedMemorySize,
        164 * 1024);  // 164 KB

    // Option 2: Set preferred shared memory carveout (fraction)
    cudaFuncSetAttribute(
        kernel_needs_lots_of_smem,
        cudaFuncAttributePreferredSharedMemoryCarveout,
        100);  // Request 100% for shared memory (hardware picks closest supported)

    // Launch
    kernel_needs_lots_of_smem<<<108, 256>>>(data);
}

The impact on occupancy is direct. If you configure 164 KB for shared memory and your kernel uses 100 KB per block, only 1 block fits per SM. With 256 threads per block, that is 8 warps = 12.5% occupancy.

// Occupancy analysis for large shared memory kernels
void analyze_large_smem_kernel() {
    int block_size = 256;
    size_t smem_sizes[] = {16*1024, 32*1024, 48*1024, 64*1024, 96*1024, 128*1024};

    for (auto smem : smem_sizes) {
        int max_blocks;
        cudaOccupancyMaxActiveBlocksPerMultiprocessor(
            &max_blocks, kernel_needs_lots_of_smem, block_size, smem);
        int active_warps = max_blocks * (block_size / 32);
        float occ = active_warps / 64.0f * 100.0f;
        printf("Smem %3zu KB: %d blocks/SM, %d warps, %.1f%% occupancy\n",
               smem / 1024, max_blocks, active_warps, occ);
    }
}
📊

Occupancy vs Shared Memory Per Block (A100, 256 threads/block)

Shared Mem/BlockMax Blocks/SMActive WarpsOccupancy
16 KB 10 64 (capped) 100%
32 KB 5 40 62.5%
48 KB 3 24 37.5%
64 KB 2 16 25%
96 KB 1 8 12.5%
128 KB 1 8 12.5%
Note: At 96 KB and above, only one block fits. The jump from 48 KB to 64 KB is the most significant — crossing the boundary from 3 blocks to 2 blocks.

Block Size Selection Strategies

Strategy 1: Maximize Occupancy

Use cudaOccupancyMaxPotentialBlockSize to find the block size that maximizes occupancy for a given kernel:

template <typename KernelFunc>
int find_best_block_size(KernelFunc kernel, size_t dynamic_smem = 0) {
    int min_grid, block_size;
    cudaOccupancyMaxPotentialBlockSize(&min_grid, &block_size, kernel, dynamic_smem, 0);
    return block_size;
}

Strategy 2: Empirical Sweep

Occupancy is a proxy for performance, not performance itself. Sweep block sizes and measure:

#include <cuda_runtime.h>
#include <cstdio>
#include <vector>

template <typename Func>
void sweep_and_benchmark(Func kernel, float* d_in, float* d_out, int n,
                         const char* name) {
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    printf("\n=== %s ===\n", name);
    printf("Block Size | Blocks/SM | Occupancy | Time (us)\n");

    int best_bs = 0;
    float best_time = 1e9;

    for (int bs = 32; bs <= 1024; bs *= 2) {
        int max_blocks;
        cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks, kernel, bs, 0);

        int active_warps = max_blocks * (bs / 32);
        float occ = active_warps / 64.0f * 100.0f;

        int grid = (n + bs - 1) / bs;

        // Warmup
        kernel<<<grid, bs>>>(d_in, d_out, n);
        cudaDeviceSynchronize();

        // Benchmark
        cudaEventRecord(start);
        for (int i = 0; i < 100; i++) {
            kernel<<<grid, bs>>>(d_in, d_out, n);
        }
        cudaEventRecord(stop);
        cudaEventSynchronize(stop);

        float ms;
        cudaEventElapsedTime(&ms, start, stop);
        float us = ms * 10.0f;  // Per-invocation microseconds

        printf("%10d | %9d | %8.1f%% | %8.1f\n", bs, max_blocks, occ, us);

        if (us < best_time) {
            best_time = us;
            best_bs = bs;
        }
    }

    printf("Best: block size %d at %.1f us\n", best_bs, best_time);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);
}

Strategy 3: Tile-Driven Block Size

For tiled algorithms (matrix multiply, stencil), the block size is determined by the tile dimensions, and shared memory usage is determined by the tile size. Occupancy is a consequence, not a goal:

// GEMM tile: 128x128 block tile, 8x8 thread tile
// Block: (128/8) x (128/8) = 16x16 = 256 threads
// Smem: 2 tiles of 128x16 floats = 2 * 128 * 16 * 4 = 16 KB
// Registers: ~64 per thread for 8x8 accumulator + operands
// Occupancy: 65536 / (64 * 32) = 32 warps = 50%

// Smaller tile: 64x64 block tile, 4x4 thread tile
// Block: (64/4) x (64/4) = 16x16 = 256 threads
// Smem: 2 * 64 * 16 * 4 = 8 KB
// Registers: ~32 per thread
// Occupancy: 65536 / (32 * 32) = 64 warps = 100%

// The 128x128 tile has better cache reuse and instruction-level parallelism
// despite lower occupancy, so it typically wins for large matrices.

Occupancy and Warp Scheduling Interactions

The SM has 4 warp schedulers on Ampere. Each scheduler needs ready warps to issue instructions. The minimum useful occupancy depends on memory latency:

Warps to hide memory latency=latency (cycles)×throughput (warps/cycle)\text{Warps to hide memory latency} = \text{latency (cycles)} \times \text{throughput (warps/cycle)}

For global memory with ~400 cycle latency and 4 schedulers (each issuing 1 warp/cycle):

Min warps=400×4/pipeline depth\text{Min warps} = 400 \times 4 / \text{pipeline depth}

With out-of-order-like execution within a warp and multiple outstanding loads, approximately 12-16 active warps typically suffice to hide global memory latency on an A100. This is 18.75%-25% occupancy.

For compute-bound kernels, the minimum is lower — about 4-8 warps (one or two per scheduler) if there are no memory stalls.

// Demonstrate: instrument a kernel to count active warps
__global__ void warp_count_kernel(float* data, int n, int* warp_count) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // Count active warps on this SM using clock and atomics
    // (This is a diagnostic tool, not production code)
    unsigned int smid;
    asm("mov.u32 %0, %%smid;" : "=r"(smid));

    // One thread per warp reports
    if (threadIdx.x % 32 == 0) {
        atomicAdd(&warp_count[smid], 1);
    }

    if (idx < n) {
        data[idx] *= 2.0f;
    }
}

Multi-Dimensional Block Configurations

For 2D and 3D kernels, the total thread count is what matters for occupancy, but the dimension layout affects memory access patterns:

#include <cuda_runtime.h>
#include <cstdio>

// All of these have 256 threads total, same occupancy:
// dim3(256, 1, 1), dim3(128, 2, 1), dim3(64, 4, 1), dim3(32, 8, 1), dim3(16, 16, 1)

// But memory access differs. For row-major matrix access:
// A[row][col], where row = blockIdx.y * blockDim.y + threadIdx.y
//                     col = blockIdx.x * blockDim.x + threadIdx.x

// Warp 0: threads 0-31, which maps to threadIdx.x = 0-15 with threadIdx.y = 0-1
// (for dim3(16, 16, 1))

// For coalesced column access, need threadIdx.x to vary fastest within a warp
// dim3(32, 8) guarantees 32 consecutive threadIdx.x values per warp

__global__ void matrix_kernel_2d(const float* __restrict__ A,
                                  float* __restrict__ B,
                                  int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    if (row < rows && col < cols) {
        // A[row * cols + col]: consecutive threads (same row, adjacent cols) -> coalesced
        B[row * cols + col] = A[row * cols + col] * 2.0f;
    }
}

void launch_2d() {
    int rows = 4096, cols = 4096;
    // dim3(32, 8): 32 threads in x -> full warp reads consecutive columns -> coalesced
    dim3 block(32, 8);  // 256 threads total
    dim3 grid((cols + 31) / 32, (rows + 7) / 8);
    // matrix_kernel_2d<<<grid, block>>>(...);
}
💡 Thread-to-Warp Mapping for 2D Blocks

CUDA maps threads to warps in row-major order: warpId = (threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x) / 32. For a dim3(16, 16) block, warp 0 contains threads with threadIdx.y=0, threadIdx.x=0-15 and threadIdx.y=1, threadIdx.x=0-15. This means adjacent threads in a warp span 2 rows, which can hurt access patterns.

Nsight Compute Occupancy Analysis

Nsight Compute reports achieved occupancy (the actual average, which may be lower than theoretical due to tail effects, partial waves, etc.) alongside theoretical occupancy:

# Profile occupancy metrics
ncu --metrics \
    sm__warps_active.avg.pct_of_peak_sustained_active,\
    sm__warps_active.avg.per_cycle_active,\
    launch__occupancy_limit_registers,\
    launch__occupancy_limit_shared_mem,\
    launch__occupancy_limit_blocks,\
    launch__occupancy_limit_warps,\
    launch__registers_per_thread,\
    launch__shared_mem_per_block_dynamic,\
    launch__shared_mem_per_block_static,\
    launch__block_size \
    ./my_kernel

# Key metrics to look for:
# sm__warps_active.avg.pct_of_peak_sustained_active = achieved occupancy %
# launch__occupancy_limit_registers = theoretical max due to registers
# launch__occupancy_limit_shared_mem = theoretical max due to smem
# launch__occupancy_limit_blocks = theoretical max due to block slots
# launch__occupancy_limit_warps = final theoretical occupancy

Interpreting the Occupancy Section

Section: Occupancy
  Theoretical Occupancy       75.0%    48/64 warps
  Achieved Occupancy          68.3%    43.7/64 warps
  Achieved Active Warps       43.7

  Block Limit SM              32       Not limiting
  Block Limit Registers       6        LIMITING (6 blocks * 8 warps = 48)
  Block Limit Shared Mem      10       Not limiting
  Block Limit Warps           8        Not limiting

  Registers Per Thread        40       Effective: 40
  Static Shared Memory        0 B
  Dynamic Shared Memory       8192 B
  Block Size                  256

In this output, registers are the limiter. The achieved occupancy (68.3%) is lower than theoretical (75%) because the last wave of blocks does not fully populate all SMs.

Tail Effects and Wave Quantization

When the total number of blocks is not a multiple of (num_SMs * blocks_per_SM), the last wave has fewer blocks than the SM capacity. This reduces achieved occupancy:

#include <cuda_runtime.h>
#include <cstdio>

void analyze_tail_effect(int n, int block_size) {
    int num_sms = 108;  // A100
    int blocks_per_sm = 8;  // Assuming 100% occupancy with this block size
    int full_wave = num_sms * blocks_per_sm;  // 864 blocks per wave

    int total_blocks = (n + block_size - 1) / block_size;
    int full_waves = total_blocks / full_wave;
    int tail_blocks = total_blocks % full_wave;

    float tail_utilization = (float)tail_blocks / full_wave * 100.0f;

    printf("Total blocks: %d\n", total_blocks);
    printf("Full waves: %d (%d blocks each)\n", full_waves, full_wave);
    printf("Tail wave: %d blocks (%.1f%% utilization)\n", tail_blocks, tail_utilization);

    // For short kernels (1-2 waves), tail effect dominates
    if (full_waves <= 2) {
        printf("WARNING: Tail effect is significant. Consider adjusting problem size.\n");
    }
}
ℹ️ Wave Quantization in Practice

For large problems with hundreds of waves, the tail effect is negligible. For small problems launched frequently (e.g., per-token operations in LLM inference with batch size 1), the tail wave can be a significant fraction of total execution time. In these cases, choosing block sizes that minimize tail waste matters more than maximizing theoretical occupancy.

Putting It Together: Occupancy Tuning Workflow

#include <cuda_runtime.h>
#include <cstdio>

// Step 1: Write kernel with __launch_bounds__ based on initial estimate
__global__ void __launch_bounds__(256, 4)
production_kernel(const float* __restrict__ input,
                  float* __restrict__ output,
                  int n) {
    extern __shared__ float smem[];
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= n) return;

    // Load to shared memory
    smem[threadIdx.x] = input[idx];
    __syncthreads();

    // Compute with register-heavy operations
    float val = smem[threadIdx.x];
    float acc = 0.0f;
    for (int i = 0; i < 16; i++) {
        int neighbor = (threadIdx.x + i) % blockDim.x;
        acc += smem[neighbor] * (float)(i + 1);
    }
    output[idx] = acc + val;
}

// Step 2: Query occupancy at compile-time register usage
void tune_kernel() {
    int block_sizes[] = {64, 128, 256, 512, 1024};
    int best_bs = 0;
    int best_occ = 0;

    for (int bs : block_sizes) {
        size_t smem = bs * sizeof(float);
        int max_blocks;
        cudaOccupancyMaxActiveBlocksPerMultiprocessor(
            &max_blocks, production_kernel, bs, smem);
        int warps = max_blocks * (bs / 32);

        if (warps > best_occ) {
            best_occ = warps;
            best_bs = bs;
        }
        printf("BS=%4d: %2d blocks/SM, %2d warps, %.1f%% occ\n",
               bs, max_blocks, warps, warps / 64.0f * 100.0f);
    }

    // Step 3: Benchmark top candidates
    int n = 1 << 24;
    float *d_in, *d_out;
    cudaMalloc(&d_in, n * sizeof(float));
    cudaMalloc(&d_out, n * sizeof(float));

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    for (int bs : block_sizes) {
        int grid = (n + bs - 1) / bs;
        size_t smem = bs * sizeof(float);

        production_kernel<<<grid, bs, smem>>>(d_in, d_out, n);
        cudaDeviceSynchronize();

        cudaEventRecord(start);
        for (int i = 0; i < 100; i++) {
            production_kernel<<<grid, bs, smem>>>(d_in, d_out, n);
        }
        cudaEventRecord(stop);
        cudaEventSynchronize(stop);

        float ms;
        cudaEventElapsedTime(&ms, start, stop);
        printf("BS=%4d: %.2f ms/launch\n", bs, ms / 100.0f);
    }

    cudaFree(d_in);
    cudaFree(d_out);
}

Decision Framework

  1. Start with 256 threads/block — it is a safe default that works well across most kernels
  2. Profile with Nsight Compute — check which resource limits occupancy
  3. If register-limited: try __launch_bounds__ with different minBlocksPerMultiprocessor values; check for spills
  4. If shared-memory-limited: reduce shared memory usage, or accept lower occupancy if the shared memory provides sufficient data reuse
  5. If block-slot-limited: increase threads per block
  6. Always benchmark — occupancy is a useful diagnostic, not a performance guarantee

Occupancy Tuning Summary: Relative Kernel Performance

(GFLOPS)
Naive (32 thr) 12.5% occ
120 GFLOPS
Default (256) 75% occ
310 GFLOPS
Max occ (256, LB) 100% occ, spills
290 GFLOPS
Tuned (256, LB4) 50% occ, no spills
350 GFLOPS
Tuned + smem 37.5% occ, full reuse
380 GFLOPS

The best configuration is often not the one with highest occupancy. Occupancy is a necessary but not sufficient condition for performance. The right approach is to understand what limits your kernel, use occupancy analysis to diagnose resource bottlenecks, and validate with actual measurements.

Summary

Occupancy is determined by three resources: registers per thread, shared memory per block, and block slots per SM. The CUDA Occupancy API (cudaOccupancyMaxActiveBlocksPerMultiprocessor, cudaOccupancyMaxPotentialBlockSize) automates the calculation. __launch_bounds__ controls register allocation at compile time. Nsight Compute reports both theoretical and achieved occupancy, with the achieved value accounting for tail effects. The key engineering insight: occupancy is a diagnostic tool, not a performance target. Memory-bound kernels saturate bandwidth at 25-50% occupancy. Compute-bound kernels may run faster at lower occupancy if it avoids register spilling. Always measure.