You just wrote a CUDA kernel and it runs at 180 GB/s on hardware capable of 2,039 GB/s — 9% of peak memory bandwidth. Guessing at optimizations wastes time. Instead, run ncu --set full and the profiler tells you exactly what is wrong: 23% memory bandwidth utilization, 89% L2 hit rate (good), but only 12% warp occupancy (bad — you are latency-bound). Fix occupancy by reducing register usage, rerun, and bandwidth jumps to 920 GB/s. This post walks through the systematic optimization workflow: profile, identify the bottleneck, apply the targeted fix, and reprofile to verify.
The Optimization Workflow
The process is always the same: profile -> identify bottleneck -> fix -> reprofile. Never optimize blind.
# Step 1: Profile the kernel
ncu --set full -o report ./my_application
# Step 2: Look at the Speed of Light (SOL) section
# It tells you: % of peak compute, % of peak memory bandwidth
# Whichever is higher is your bottleneck
Speed of Light Interpretation
| SOL Compute | SOL Memory | Bottleneck | Optimization Path |
|---|---|---|---|
| over 60% | under 40% | Compute-bound | Reduce instructions, use specialized units (tensor cores) |
| under 40% | over 60% | Memory-bound | Improve coalescing, add caching/tiling, reduce traffic |
| under 40% | under 40% | Latency-bound | Increase occupancy, overlap compute and memory |
| over 60% | over 60% | Well-balanced | Both resources utilized -- minor tuning only |
Fix #1: Memory Coalescing (Biggest Impact)
The single most common problem. Non-coalesced global memory access wastes 4-32x bandwidth:
// BAD: Column-major access in row-major array -- stride = N per thread
__global__ void bad_coalescing(float *matrix, float *output, int N) {
int row = threadIdx.x; // Adjacent threads access different rows
int col = blockIdx.x;
output[row * N + col] = matrix[row * N + col] * 2.0f;
// Thread 0: matrix[0], Thread 1: matrix[N], Thread 2: matrix[2N]
// Stride = N -> non-coalesced!
}
// GOOD: Row-major access -- stride = 1 per thread
__global__ void good_coalescing(float *matrix, float *output, int N) {
int col = threadIdx.x; // Adjacent threads access adjacent columns
int row = blockIdx.x;
output[row * N + col] = matrix[row * N + col] * 2.0f;
// Thread 0: matrix[row*N], Thread 1: matrix[row*N+1], ...
// Stride = 1 -> perfectly coalesced!
}
Impact of Memory Coalescing Fix
(GB/s achieved bandwidth)In Nsight Compute, check l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum vs l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum. If sectors/request >> 1, you have coalescing issues. Ideal is 1 sector per request (32 threads x 4 bytes = 128 bytes = 1 sector).
Fix #2: Occupancy and Latency Hiding
Low occupancy means the GPU can’t hide memory latency by switching between warps:
Common Occupancy Limiters
| Limiter | How to Detect | Fix |
|---|---|---|
| Too many registers | ncu: Registers per thread over 64 | Use __launch_bounds__, reduce local variables |
| Too much shared memory | ncu: Shared memory over 48KB/block | Reduce tile size, use multi-phase tiling |
| Block size too large | Block dim over 512 with high resources | Try 128 or 256 threads per block |
| Block size too small | Block dim under 128 | Increase to at least 128 for latency hiding |
// Limit register usage to improve occupancy
__global__ __launch_bounds__(256, 4) // 256 threads, at least 4 blocks/SM
void optimized_kernel(float *data, int n) {
// Compiler will spill registers to local memory if needed
// to maintain the 4-blocks-per-SM target
}
Fix #3: Kernel Fusion (Eliminate Launch Overhead)
Each CUDA kernel launch costs ~5-10 us. For small kernels, this overhead dominates:
Kernel Launch Overhead Analysis
| Kernel | Compute Time | Launch Overhead | Overhead % | Action |
|---|---|---|---|---|
| Large GEMM | 2.0 ms | 5 us | 0.25% | No action needed |
| LayerNorm | 50 us | 5 us | 10% | Fuse with adjacent ops |
| Bias add | 8 us | 5 us | 38% | Must fuse -- overhead dominates |
| Activation (GELU) | 12 us | 5 us | 29% | Fuse with preceding GEMM |
The solution: fuse small element-wise operations into the preceding or following GEMM kernel, or combine sequential element-wise ops into a single kernel:
// Unfused: 3 launches, 3 HBM round-trips
bias_add_kernel<<<grid, block>>>(output, bias, n); // 8 us compute + 5 us launch
gelu_kernel<<<grid, block>>>(output, n); // 12 us compute + 5 us launch
residual_add_kernel<<<grid, block>>>(output, residual, n); // 8 us compute + 5 us launch
// Total: 43 us
// Fused: 1 launch, 1 HBM round-trip
fused_bias_gelu_residual<<<grid, block>>>(output, bias, residual, n);
// Total: 16 us (2.7x faster)
Fix #4: Instruction Mix Optimization
For compute-bound kernels, the instruction mix matters. Special function units (SFU) for transcendentals are limited:
Instruction Throughput (per SM per clock, V100)
| Operation | Unit | Throughput | Notes |
|---|---|---|---|
| FP32 FMA | CUDA cores | 64 ops/clock | Highest throughput |
| FP16 FMA (tensor) | Tensor cores | ~1024 ops/clock | 8x FP32 throughput |
| INT32 | INT units | 64 ops/clock | Same as FP32 |
| exp/log/sin/cos | SFU (special function) | 16 ops/clock | 4x slower than FMA |
| Division | SFU | 16 ops/clock | Use reciprocal + multiply instead |
// Slow: uses SFU for division (16 ops/clock)
float result = a / b;
// Fast: reciprocal approximation + multiply (64 ops/clock)
float result = a * __frcp_rn(b); // 1-ULP accurate reciprocal
// Slow: expf uses SFU
float result = expf(x);
// Faster for approximation: use __expf (less accurate but faster)
float result = __expf(x); // Fast math intrinsic
__expf, __logf, __sinf etc. are faster but have reduced precision (~2 ULP vs 1 ULP). For softmax denominators and loss computation, use the standard versions. For intermediate activations where 2-ULP error is acceptable, fast math intrinsics are free performance.
Fix #5: Avoid Warp Divergence
When threads within a warp take different branches, both paths execute serially:
// BAD: 50% of threads idle in each branch (warp divergence)
if (threadIdx.x % 2 == 0) {
expensive_path_A(data); // Half the warp
} else {
expensive_path_B(data); // Other half
}
// BETTER: Reorganize so entire warps take the same branch
int warp_id = threadIdx.x / 32;
if (warp_id % 2 == 0) {
expensive_path_A(data); // Entire warp A
} else {
expensive_path_B(data); // Entire warp B -- no divergence!
}
The Optimization Priority Checklist
Typical Impact by Optimization (ordered by expected speedup)
(x improvement)Always fix coalescing first. Then add tiling if there’s data reuse. Then fuse small kernels. Only after these should you worry about occupancy, instruction mix, and divergence.
Conclusion
CUDA optimization is a systematic process: profile with Nsight Compute, read the Speed of Light metrics to identify compute-bound vs memory-bound, then apply fixes in priority order. Memory coalescing and tiling typically deliver 4-8x improvement. Kernel fusion eliminates launch overhead for small operations. Occupancy tuning and instruction optimization provide the final 20-50%. Never optimize without profiler data — intuition about GPU performance is almost always wrong.