Launch a CUDA kernel with block size 31 instead of 32 and throughput drops by 50%. Launch with block size 256 instead of 128 and occupancy doubles but performance stays flat — or sometimes degrades. These are not bugs. They are consequences of the hardware’s three-level hierarchy: grids map to the entire GPU, blocks map to streaming multiprocessors, and threads map to warps of exactly 32. Every dimension choice — grid size, block size, thread count — directly determines how efficiently the GPU can schedule, execute, and hide latency. Understanding this mapping is the difference between a kernel that runs and a kernel that runs fast.
All measurements target NVIDIA Ampere (A100-80GB SXM, SM 8.0) unless stated otherwise. The principles generalize across architectures.
The Hardware Foundation: SMs, Warp Schedulers, and Execution Units
An A100 GPU contains 108 Streaming Multiprocessors (SMs). Each SM is an independent processor with its own:
- Register file: 65,536 32-bit registers (256 KB)
- Shared memory / L1 cache: configurable, up to 164 KB shared + 28 KB L1
- Warp schedulers: 4 per SM
- Execution units: 64 FP32 CUDA cores, 32 FP64 cores, 4 tensor cores (third-gen)
- Load/store units: 32 per SM
- Special function units (SFUs): 16 per SM
Each warp scheduler manages a pool of warps and issues one instruction per cycle from one ready warp. With 4 schedulers, an SM can issue 4 independent instructions per cycle — one from each scheduler’s selected warp.
The Warp: 32 Threads in Lockstep
A warp is the fundamental unit of execution. It consists of 32 threads that execute the same instruction simultaneously. This is NVIDIA’s Single Instruction, Multiple Threads (SIMT) execution model.
Key properties of a warp:
- All 32 threads execute the same PC (program counter) at any given cycle
- Divergence (branches where threads take different paths) causes serialization: the warp executes each path sequentially, masking inactive threads
- Memory transactions are issued per-warp: 32 threads generate addresses, and the hardware coalesces them into cache line requests
- Warp-level primitives (
__shfl_sync,__ballot_sync, etc.) communicate directly between the 32 registers within a warp
// Every thread in a warp executes this simultaneously
// Thread 0-31 all compute their own 'val', but in lockstep
__global__ void example_kernel(float* data, int n) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
float val = data[tid]; // 32 threads issue loads together
val = val * 2.0f + 1.0f; // 32 threads compute together
data[tid] = val; // 32 threads issue stores together
}
}
Warp Scheduling: Latency Hiding Through Occupancy
When a warp issues a memory load, it stalls until the data arrives (hundreds of cycles for global memory). The warp scheduler simply switches to another ready warp. This is the GPU’s primary latency-hiding mechanism: keep enough warps resident so there is always one ready to execute.
The number of resident warps per SM is called occupancy:
On Ampere, maximum warps per SM = 64 (2048 threads / 32 threads per warp). 100% occupancy means 64 warps are resident simultaneously, giving the scheduler maximum freedom to hide latency.
High occupancy does not guarantee high performance. A kernel at 50% occupancy can outperform one at 100% occupancy if the lower-occupancy kernel has better memory access patterns or higher instruction-level parallelism. Occupancy provides the opportunity to hide latency; whether that opportunity matters depends on whether your kernel is latency-bound in the first place.
The Three-Level Hierarchy: Grid, Block, Thread
When you launch a CUDA kernel, you specify a grid of blocks, where each block contains a fixed number of threads:
// Launch configuration
dim3 gridDim(num_blocks_x, num_blocks_y, num_blocks_z);
dim3 blockDim(threads_per_block_x, threads_per_block_y, threads_per_block_z);
kernel<<<gridDim, blockDim>>>(args...);
Grid: The Top Level
A grid is the entire collection of thread blocks launched by a single kernel invocation. Grid dimensions can be 1D, 2D, or 3D:
- 1D grid:
gridDim = (N, 1, 1)— common for vector operations - 2D grid:
gridDim = (Nx, Ny, 1)— common for 2D data (images, matrices) - 3D grid:
gridDim = (Nx, Ny, Nz)— common for volumetric data
Maximum grid dimensions on Compute Capability 8.0:
gridDim.x: up to (2,147,483,647)gridDim.y: up to 65,535gridDim.z: up to 65,535
Block: The Cooperative Unit
A thread block is a group of threads that can:
- Synchronize via
__syncthreads() - Share data through shared memory (fast, on-chip SRAM)
- Execute on the same SM — all threads in a block are guaranteed to be co-resident
Block dimensions are also 1D, 2D, or 3D. Maximum limits on CC 8.0:
blockDim.x: up to 1024blockDim.y: up to 1024blockDim.z: up to 64- Total threads per block: up to 1024
A block is divided into warps. A block of 256 threads contains 8 warps. The mapping is straightforward: thread IDs 0-31 form warp 0, threads 32-63 form warp 1, and so on.
// For a 2D block of (16, 16) = 256 threads:
// Linear thread ID within block:
int linear_tid = threadIdx.x + threadIdx.y * blockDim.x;
// Warp ID within block:
int warp_id = linear_tid / 32; // 0..7
// Lane ID within warp:
int lane_id = linear_tid % 32; // 0..31
Thread: The Individual Unit
Each thread has:
- A unique ID within its block:
threadIdx.x,threadIdx.y,threadIdx.z - Access to the block’s shared memory
- Its own registers (allocated from the SM’s register file)
- A unique global ID computed from
threadIdx,blockIdx, andblockDim
Thread Indexing: Computing Global IDs
The most fundamental pattern in CUDA programming is computing a global thread index from the built-in variables.
1D Indexing
// N elements, one thread per element
__global__ void vector_add_1d(float* a, float* b, float* c, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
// Launch: enough blocks to cover N elements
int block_size = 256;
int grid_size = (n + block_size - 1) / block_size;
vector_add_1d<<<grid_size, block_size>>>(a, b, c, n);
2D Indexing
// M x N matrix, one thread per element
__global__ void matrix_scale_2d(float* mat, float scalar, int M, int N) {
int col = threadIdx.x + blockIdx.x * blockDim.x;
int row = threadIdx.y + blockIdx.y * blockDim.y;
if (row < M && col < N) {
mat[row * N + col] = mat[row * N + col] * scalar;
}
}
// Launch: 2D grid of 2D blocks
dim3 block(16, 16); // 256 threads per block
dim3 grid((N + 15) / 16, (M + 15) / 16);
matrix_scale_2d<<<grid, block>>>(mat, scalar, M, N);
3D Indexing
// Volume processing: W x H x D
__global__ void volume_process(float* vol, int W, int H, int D) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int z = threadIdx.z + blockIdx.z * blockDim.z;
if (x < W && y < H && z < D) {
int idx = x + y * W + z * W * H;
vol[idx] = process(vol[idx]);
}
}
dim3 block(8, 8, 4); // 256 threads
dim3 grid((W+7)/8, (H+7)/8, (D+3)/4);
volume_process<<<grid, block>>>(vol, W, H, D);
Grid-Stride Loops: When One Thread Handles Multiple Elements
When the total number of elements exceeds the number of launched threads (or you want a fixed grid size for persistent kernel patterns), use a grid-stride loop:
__global__ void vector_add_stride(float* a, float* b, float* c, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n; i += stride) {
c[i] = a[i] + b[i];
}
}
// Launch with limited grid size
int block_size = 256;
int grid_size = min((n + block_size - 1) / block_size, 1024);
vector_add_stride<<<grid_size, block_size>>>(a, b, c, n);
Grid-stride loops have three advantages: they amortize launch overhead, they allow a fixed grid size (useful for persistent kernels), and they ensure coalesced access across iterations because the stride is a multiple of the block size.
Block Scheduling: How Blocks Map to SMs
The GPU block scheduler assigns blocks to SMs at runtime. This is the critical link between your launch configuration and hardware utilization.
The Block Scheduler Algorithm
- The scheduler maintains a queue of unexecuted blocks
- For each SM, it checks if the SM has enough resources for the next block: registers, shared memory, warp slots, block slots
- If resources are available, the block is assigned to that SM
- When a block completes (all threads exit), its resources are freed and a new block can be scheduled
Key implications:
- Blocks are independent: the scheduler can assign them to any SM in any order
- Blocks do not migrate: once assigned to an SM, a block runs there until completion
- Multiple blocks can share an SM: if resources permit, 2, 4, 8, or more blocks can be resident simultaneously
- Minimum parallelism: you need at least as many blocks as SMs (108 on A100) to keep the GPU busy
Block Count vs GPU Utilization (A100, 108 SMs)
| Grid Size (blocks) | SMs Active | GPU Utilization | Notes |
|---|---|---|---|
| 1 | 1 / 108 | 0.9% | Extremely underutilized |
| 54 | 54 / 108 | 50% | Half the GPU idle |
| 108 | 108 / 108 | 100% (1 block/SM) | Minimum full utilization |
| 216 | 108 / 108 | 100% (2 blocks/SM) | Better latency hiding |
| 864 | 108 / 108 | 100% (8 blocks/SM) | Near-maximum occupancy |
| 10000+ | 108 / 108 | 100% | Ample parallelism |
Resource Limits per SM (Ampere A100, CC 8.0)
Each SM has hard resource limits that constrain how many blocks can be resident:
- Maximum blocks per SM: 32
- Maximum warps per SM: 64
- Maximum threads per SM: 2048
- Register file: 65,536 registers
- Shared memory: up to 164 KB (configurable)
The actual number of resident blocks is the minimum across all resource constraints:
Occupancy Analysis: Choosing Block Size
Block size is the single most impactful launch configuration parameter. Here is a systematic analysis.
Register Pressure
Each thread uses registers for local variables. The compiler allocates registers, and the count is visible via --ptxas-options=-v at compile time. If a kernel uses 32 registers per thread and the block size is 256 threads:
But if the kernel uses 64 registers per thread:
Shared Memory Pressure
If a kernel uses 32 KB of shared memory per block:
Block Size Occupancy Table
Occupancy vs Block Size (A100, 32 regs/thread, 0 smem)
| Block Size | Warps/Block | Blocks/SM (thread-limited) | Blocks/SM (reg-limited) | Blocks/SM (actual) | Occupancy |
|---|---|---|---|---|---|
| 32 | 1 | 32 (capped) | 32 (capped) | 32 | 50% |
| 64 | 2 | 32 | 32 | 32 | 100% |
| 128 | 4 | 16 | 16 | 16 | 100% |
| 256 | 8 | 8 | 8 | 8 | 100% |
| 512 | 16 | 4 | 4 | 4 | 100% |
| 1024 | 32 | 2 | 2 | 2 | 100% |
The Occupancy API
CUDA provides a runtime API to query optimal launch configurations:
#include <cuda_runtime.h>
// Query maximum active blocks per SM for a given kernel and block size
int num_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&num_blocks,
my_kernel, // kernel function pointer
256, // block size
0 // dynamic shared memory per block
);
// Auto-select block size that maximizes occupancy
int min_grid_size, block_size;
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size,
&block_size,
my_kernel,
0, // dynamic smem
0 // block size limit (0 = no limit)
);
printf("Optimal block size: %d, min grid size: %d\n", block_size, min_grid_size);
Use this API as a starting point, not a final answer. It maximizes occupancy, which is the right heuristic for memory-bound kernels. For compute-bound kernels, you may get better performance at lower occupancy if the extra registers enable more instruction-level parallelism. Always benchmark.
Warp Divergence: The Cost of Branching
When threads within a warp take different branches, the warp must execute both paths sequentially. This is called warp divergence.
// BAD: Divergent branching within a warp
__global__ void divergent_kernel(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
if (threadIdx.x % 2 == 0) {
// Even threads take this path
data[idx] = expensive_function_a(data[idx]);
} else {
// Odd threads take this path
// Warp executes both paths sequentially
data[idx] = expensive_function_b(data[idx]);
}
}
}
In the above kernel, every warp has 16 threads taking path A and 16 threads taking path B. The warp executes path A (16 threads active, 16 masked), then path B (16 active, 16 masked). Effective throughput: 50%.
// BETTER: Divergence at warp boundaries
__global__ void aligned_kernel(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int warp_id = threadIdx.x / 32;
if (idx < n) {
if (warp_id % 2 == 0) {
// Entire warp 0, 2, 4... takes this path
data[idx] = expensive_function_a(data[idx]);
} else {
// Entire warp 1, 3, 5... takes this path
// No divergence within any warp
data[idx] = expensive_function_b(data[idx]);
}
}
}
The cost model for divergence:
For branches that affect entire warps uniformly, there is no divergence cost at all. The hardware simply executes the taken path. This is why organizing data so that adjacent threads (within a warp) take the same branch is a fundamental optimization.
Measuring Divergence with Nsight Compute
Nsight Compute reports the smsp__thread_inst_executed_per_inst_executed metric, which tells you the average number of active threads per instruction. For a non-divergent kernel, this is 32. For a 50% divergent kernel, it is 16.
ncu --metrics smsp__thread_inst_executed_per_inst_executed.avg \
./my_program
Implementation: Vector Add with Optimal Block Size Selection
This complete implementation demonstrates every concept covered above: thread indexing, block size selection, occupancy analysis, and grid-stride loops.
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
// Error checking macro
#define CUDA_CHECK(call) do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while(0)
// Kernel: grid-stride vector add
__global__ void vector_add(const float* __restrict__ a,
const float* __restrict__ b,
float* __restrict__ c,
int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
// Grid-stride loop: each thread processes multiple elements
for (int i = idx; i < n; i += stride) {
c[i] = a[i] + b[i];
}
}
// Kernel: vectorized (float4) grid-stride vector add
__global__ void vector_add_vec4(const float4* __restrict__ a,
const float4* __restrict__ b,
float4* __restrict__ c,
int n4) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n4; i += stride) {
float4 va = a[i];
float4 vb = b[i];
c[i] = make_float4(va.x + vb.x, va.y + vb.y,
va.z + vb.z, va.w + vb.w);
}
}
struct LaunchConfig {
int block_size;
int grid_size;
int occupancy_blocks_per_sm;
float occupancy_pct;
};
// Analyze and select optimal launch configuration
LaunchConfig select_launch_config(void* kernel, int n,
int dynamic_smem = 0) {
LaunchConfig config;
// Let CUDA runtime select optimal block size
int min_grid_size;
CUDA_CHECK(cudaOccupancyMaxPotentialBlockSize(
&min_grid_size,
&config.block_size,
kernel,
dynamic_smem,
0 // no block size limit
));
// Query actual occupancy at this block size
CUDA_CHECK(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&config.occupancy_blocks_per_sm,
kernel,
config.block_size,
dynamic_smem
));
// Compute occupancy percentage
config.occupancy_pct = (float)(config.occupancy_blocks_per_sm *
config.block_size) / 2048.0f * 100.0f;
// Grid size: enough blocks to fill the GPU, then rely on grid-stride
int device;
cudaDeviceProp props;
CUDA_CHECK(cudaGetDevice(&device));
CUDA_CHECK(cudaGetDeviceProperties(&props, device));
// Target: occupancy_blocks_per_sm * num_SMs blocks
config.grid_size = config.occupancy_blocks_per_sm *
props.multiProcessorCount;
// But don't exceed what's needed for the data
int needed = (n + config.block_size - 1) / config.block_size;
if (needed < config.grid_size) {
config.grid_size = needed;
}
return config;
}
void benchmark_block_sizes(int n) {
float *d_a, *d_b, *d_c;
size_t bytes = n * sizeof(float);
CUDA_CHECK(cudaMalloc(&d_a, bytes));
CUDA_CHECK(cudaMalloc(&d_b, bytes));
CUDA_CHECK(cudaMalloc(&d_c, bytes));
// Initialize
float* h_a = (float*)malloc(bytes);
for (int i = 0; i < n; i++) h_a[i] = 1.0f;
CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(d_b, h_a, bytes, cudaMemcpyHostToDevice));
free(h_a);
printf("Vector size: %d elements (%.1f MB)\n", n, bytes / 1e6);
printf("%-12s %-12s %-12s %-15s %-12s\n",
"BlockSize", "GridSize", "Occupancy%", "Bandwidth(GB/s)", "Time(ms)");
printf("%-12s %-12s %-12s %-15s %-12s\n",
"--------", "--------", "---------", "--------------", "-------");
int block_sizes[] = {32, 64, 128, 256, 512, 1024};
for (int bs : block_sizes) {
int occ_blocks;
CUDA_CHECK(cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&occ_blocks, (void*)vector_add, bs, 0));
float occ_pct = (float)(occ_blocks * bs) / 2048.0f * 100.0f;
int grid = min((n + bs - 1) / bs, occ_blocks * 108);
// Warmup
vector_add<<<grid, bs>>>(d_a, d_b, d_c, n);
CUDA_CHECK(cudaDeviceSynchronize());
// Benchmark
cudaEvent_t start, stop;
CUDA_CHECK(cudaEventCreate(&start));
CUDA_CHECK(cudaEventCreate(&stop));
int iterations = 100;
CUDA_CHECK(cudaEventRecord(start));
for (int i = 0; i < iterations; i++) {
vector_add<<<grid, bs>>>(d_a, d_b, d_c, n);
}
CUDA_CHECK(cudaEventRecord(stop));
CUDA_CHECK(cudaEventSynchronize(stop));
float ms;
CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
ms /= iterations;
// Bandwidth: 3 arrays * n elements * 4 bytes each
double bw = 3.0 * n * sizeof(float) / (ms / 1000.0) / 1e9;
printf("%-12d %-12d %-12.0f %-15.1f %-12.3f\n",
bs, grid, occ_pct, bw, ms);
CUDA_CHECK(cudaEventDestroy(start));
CUDA_CHECK(cudaEventDestroy(stop));
}
CUDA_CHECK(cudaFree(d_a));
CUDA_CHECK(cudaFree(d_b));
CUDA_CHECK(cudaFree(d_c));
}
int main() {
int n = 1 << 24; // 16M elements
// Auto-select optimal configuration
LaunchConfig config = select_launch_config(
(void*)vector_add, n);
printf("Auto-selected: block=%d, grid=%d, occupancy=%.0f%%\n",
config.block_size, config.grid_size, config.occupancy_pct);
// Benchmark all block sizes
benchmark_block_sizes(n);
return 0;
}
Expected Results
Vector Add Bandwidth vs Block Size (A100, N=16M, FP32)
| Block Size | Occupancy | Bandwidth (GB/s) | % of Peak (2039 GB/s) |
|---|---|---|---|
| 32 | 50% | 1580 | 77.5% |
| 64 | 100% | 1810 | 88.8% |
| 128 | 100% | 1890 | 92.7% |
| 256 | 100% | 1910 | 93.7% |
| 512 | 100% | 1900 | 93.2% |
| 1024 | 100% | 1880 | 92.2% |
Effective Bandwidth by Block Size
(GB/s)Vectorized Access: float4
Using float4 loads (128-bit per thread) improves bandwidth by reducing the number of memory transactions:
// Launch for vectorized kernel
int n4 = n / 4;
LaunchConfig config4 = select_launch_config(
(void*)vector_add_vec4, n4);
vector_add_vec4<<<config4.grid_size, config4.block_size>>>(
(float4*)d_a, (float4*)d_b, (float4*)d_c, n4);
Scalar vs Vectorized Vector Add (A100, N=16M)
| Variant | Block Size | Bandwidth (GB/s) | % of Peak |
|---|---|---|---|
| Scalar (float) | 256 | 1910 | 93.7% |
| Vectorized (float4) | 256 | 1970 | 96.6% |
Advanced: Cooperative Groups
CUDA Cooperative Groups (introduced in CUDA 9) generalize the synchronization model beyond block-level __syncthreads():
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void cooperative_example(float* data, int n) {
// Thread block group (equivalent to __syncthreads scope)
cg::thread_block block = cg::this_thread_block();
// Warp-level group (no need for shared memory)
cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
// Warp-level reduction using cooperative groups
int idx = threadIdx.x + blockIdx.x * blockDim.x;
float val = (idx < n) ? data[idx] : 0.0f;
// Reduce within warp (no shared memory needed)
for (int offset = warp.size() / 2; offset > 0; offset /= 2) {
val += warp.shfl_down(val, offset);
}
// Thread 0 of each warp has the warp sum
if (warp.thread_rank() == 0) {
// Write partial sum to shared memory for block-level reduction
extern __shared__ float warp_sums[];
warp_sums[warp.meta_group_rank()] = val;
}
block.sync(); // Equivalent to __syncthreads()
// First warp reduces warp sums
if (warp.meta_group_rank() == 0) {
int num_warps = block.size() / 32;
val = (warp.thread_rank() < num_warps) ?
warp_sums[warp.thread_rank()] : 0.0f;
for (int offset = warp.size() / 2; offset > 0; offset /= 2) {
val += warp.shfl_down(val, offset);
}
}
}
Grid-Level Synchronization
Cooperative launch enables synchronization across the entire grid. All blocks must be resident simultaneously, which limits the grid size:
// Grid-level cooperative kernel
__global__ void grid_cooperative_kernel(float* data) {
cg::grid_group grid = cg::this_grid();
// Phase 1: local computation
int idx = grid.thread_rank();
data[idx] = compute(data[idx]);
// Barrier: ALL threads across ALL blocks synchronize
grid.sync();
// Phase 2: read neighbors (guaranteed to see phase 1 results)
data[idx] = data[idx] + data[(idx + 1) % grid.size()];
}
// Must use cooperative launch
void* args[] = { &d_data };
cudaLaunchCooperativeKernel(
(void*)grid_cooperative_kernel,
grid_size, // limited by SM count * blocks_per_sm
block_size,
args
);
Grid synchronization requires all blocks to be resident simultaneously. On A100, if your kernel supports 4 blocks per SM, the maximum grid size is blocks. This significantly limits the problem sizes you can handle with grid sync. Use it only when the algorithm genuinely requires global synchronization.
Common Mistakes and How to Fix Them
Mistake 1: Block Size Too Small
// BAD: 32 threads per block = 1 warp per block
// Wastes block slots (max 32 blocks/SM), limits occupancy to 50%
kernel<<<grid, 32>>>(data, n);
// GOOD: 256 threads per block
kernel<<<grid, 256>>>(data, n);
Mistake 2: Ignoring Tail Effects
// BAD: No bounds check, out-of-bounds access
__global__ void bad_kernel(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
data[idx] = data[idx] * 2.0f; // UB when idx >= n
}
// GOOD: Bounds check
__global__ void good_kernel(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
data[idx] = data[idx] * 2.0f;
}
}
Mistake 3: Hardcoding Grid Size for Specific GPU
// BAD: Assumes 108 SMs (A100-specific)
kernel<<<108 * 8, 256>>>(data, n);
// GOOD: Query device properties
int device;
cudaGetDevice(&device);
cudaDeviceProp props;
cudaGetDeviceProperties(&props, device);
int grid = props.multiProcessorCount * blocks_per_sm;
kernel<<<grid, 256>>>(data, n);
Mistake 4: Excessive Register Usage Killing Occupancy
// Check register usage at compile time
// nvcc --ptxas-options=-v kernel.cu
// ptxas info: Used 64 registers, 0 bytes smem
// Limit registers to increase occupancy
__global__ __launch_bounds__(256, 4) // max 256 threads, min 4 blocks/SM
void limited_kernel(float* data, int n) {
// Compiler will spill to local memory if needed to stay within
// 65536 / (4 * 256) = 64 registers per thread
}
Summary: Decision Framework for Launch Configuration
- Start with
cudaOccupancyMaxPotentialBlockSizeto get a baseline block size - Use 128 or 256 threads per block as defaults — these work well across most kernels
- Check register usage with
--ptxas-options=-vand ensure you have at least 50% occupancy - Use grid-stride loops for large data to keep grid sizes manageable and portable
- Benchmark at least three block sizes (128, 256, 512) with real data to find the actual optimum
- Profile with Nsight Compute to identify whether your kernel is limited by occupancy, memory bandwidth, or compute
This is Part 1 of the CUDA Kernel Engineering series. Part 2 covers memory coalescing — the access pattern rules that determine whether your kernel achieves 90% or 9% of peak memory bandwidth.