You just wrote a CUDA kernel and it runs at 180 GB/s on hardware capable of 2,039 GB/s — 9% of peak memory bandwidth. Guessing at optimizations wastes time. Instead, run ncu --set full and the profiler tells you exactly what is wrong: 23% memory bandwidth utilization, 89% L2 hit rate (good), but only 12% warp occupancy (bad — you are latency-bound). Fix occupancy by reducing register usage, rerun, and bandwidth jumps to 920 GB/s. This post walks through the systematic optimization workflow: profile, identify the bottleneck, apply the targeted fix, and reprofile to verify.

The Optimization Workflow

The process is always the same: profile -> identify bottleneck -> fix -> reprofile. Never optimize blind.

# Step 1: Profile the kernel
ncu --set full -o report ./my_application

# Step 2: Look at the Speed of Light (SOL) section
# It tells you: % of peak compute, % of peak memory bandwidth
# Whichever is higher is your bottleneck
📊

Speed of Light Interpretation

SOL ComputeSOL MemoryBottleneckOptimization Path
over 60% under 40% Compute-bound Reduce instructions, use specialized units (tensor cores)
under 40% over 60% Memory-bound Improve coalescing, add caching/tiling, reduce traffic
under 40% under 40% Latency-bound Increase occupancy, overlap compute and memory
over 60% over 60% Well-balanced Both resources utilized -- minor tuning only
Note: Most AI kernels are memory-bound or latency-bound. Compute-bound is the goal but rare outside of large GEMM.

Fix #1: Memory Coalescing (Biggest Impact)

The single most common problem. Non-coalesced global memory access wastes 4-32x bandwidth:

// BAD: Column-major access in row-major array -- stride = N per thread
__global__ void bad_coalescing(float *matrix, float *output, int N) {
    int row = threadIdx.x;  // Adjacent threads access different rows
    int col = blockIdx.x;
    output[row * N + col] = matrix[row * N + col] * 2.0f;
    // Thread 0: matrix[0], Thread 1: matrix[N], Thread 2: matrix[2N]
    // Stride = N -> non-coalesced!
}

// GOOD: Row-major access -- stride = 1 per thread
__global__ void good_coalescing(float *matrix, float *output, int N) {
    int col = threadIdx.x;  // Adjacent threads access adjacent columns
    int row = blockIdx.x;
    output[row * N + col] = matrix[row * N + col] * 2.0f;
    // Thread 0: matrix[row*N], Thread 1: matrix[row*N+1], ...
    // Stride = 1 -> perfectly coalesced!
}

Impact of Memory Coalescing Fix

(GB/s achieved bandwidth)
Non-coalesced (stride=N) 5% of peak
45 GB/s achieved bandwidth
Partially coalesced (stride=2)
420 GB/s achieved bandwidth
Fully coalesced (stride=1) 92% of peak
830 GB/s achieved bandwidth
How to Detect

In Nsight Compute, check l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum vs l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum. If sectors/request >> 1, you have coalescing issues. Ideal is 1 sector per request (32 threads x 4 bytes = 128 bytes = 1 sector).

Fix #2: Occupancy and Latency Hiding

Low occupancy means the GPU can’t hide memory latency by switching between warps:

📊

Common Occupancy Limiters

LimiterHow to DetectFix
Too many registers ncu: Registers per thread over 64 Use __launch_bounds__, reduce local variables
Too much shared memory ncu: Shared memory over 48KB/block Reduce tile size, use multi-phase tiling
Block size too large Block dim over 512 with high resources Try 128 or 256 threads per block
Block size too small Block dim under 128 Increase to at least 128 for latency hiding
// Limit register usage to improve occupancy
__global__ __launch_bounds__(256, 4)  // 256 threads, at least 4 blocks/SM
void optimized_kernel(float *data, int n) {
    // Compiler will spill registers to local memory if needed
    // to maintain the 4-blocks-per-SM target
}

Fix #3: Kernel Fusion (Eliminate Launch Overhead)

Each CUDA kernel launch costs ~5-10 us. For small kernels, this overhead dominates:

📊

Kernel Launch Overhead Analysis

KernelCompute TimeLaunch OverheadOverhead %Action
Large GEMM 2.0 ms 5 us 0.25% No action needed
LayerNorm 50 us 5 us 10% Fuse with adjacent ops
Bias add 8 us 5 us 38% Must fuse -- overhead dominates
Activation (GELU) 12 us 5 us 29% Fuse with preceding GEMM

The solution: fuse small element-wise operations into the preceding or following GEMM kernel, or combine sequential element-wise ops into a single kernel:

// Unfused: 3 launches, 3 HBM round-trips
bias_add_kernel<<<grid, block>>>(output, bias, n);        // 8 us compute + 5 us launch
gelu_kernel<<<grid, block>>>(output, n);                  // 12 us compute + 5 us launch  
residual_add_kernel<<<grid, block>>>(output, residual, n); // 8 us compute + 5 us launch
// Total: 43 us

// Fused: 1 launch, 1 HBM round-trip
fused_bias_gelu_residual<<<grid, block>>>(output, bias, residual, n);
// Total: 16 us (2.7x faster)

Fix #4: Instruction Mix Optimization

For compute-bound kernels, the instruction mix matters. Special function units (SFU) for transcendentals are limited:

📊

Instruction Throughput (per SM per clock, V100)

OperationUnitThroughputNotes
FP32 FMA CUDA cores 64 ops/clock Highest throughput
FP16 FMA (tensor) Tensor cores ~1024 ops/clock 8x FP32 throughput
INT32 INT units 64 ops/clock Same as FP32
exp/log/sin/cos SFU (special function) 16 ops/clock 4x slower than FMA
Division SFU 16 ops/clock Use reciprocal + multiply instead
// Slow: uses SFU for division (16 ops/clock)
float result = a / b;

// Fast: reciprocal approximation + multiply (64 ops/clock)
float result = a * __frcp_rn(b);  // 1-ULP accurate reciprocal

// Slow: expf uses SFU
float result = expf(x);

// Faster for approximation: use __expf (less accurate but faster)
float result = __expf(x);  // Fast math intrinsic
⚠️ Fast Math Trade-offs

__expf, __logf, __sinf etc. are faster but have reduced precision (~2 ULP vs 1 ULP). For softmax denominators and loss computation, use the standard versions. For intermediate activations where 2-ULP error is acceptable, fast math intrinsics are free performance.

Fix #5: Avoid Warp Divergence

When threads within a warp take different branches, both paths execute serially:

// BAD: 50% of threads idle in each branch (warp divergence)
if (threadIdx.x % 2 == 0) {
    expensive_path_A(data);  // Half the warp
} else {
    expensive_path_B(data);  // Other half
}

// BETTER: Reorganize so entire warps take the same branch
int warp_id = threadIdx.x / 32;
if (warp_id % 2 == 0) {
    expensive_path_A(data);  // Entire warp A
} else {
    expensive_path_B(data);  // Entire warp B -- no divergence!
}

The Optimization Priority Checklist

Typical Impact by Optimization (ordered by expected speedup)

(x improvement)
Fix memory coalescing Fix first -- dominates
8 x improvement
Add shared memory tiling
4 x improvement
Fuse small kernels
2.5 x improvement
Fix occupancy
1.5 x improvement
Use tensor cores If applicable
1.5 x improvement
Fast math intrinsics
1.2 x improvement
Reduce divergence
1.1 x improvement

Always fix coalescing first. Then add tiling if there’s data reuse. Then fuse small kernels. Only after these should you worry about occupancy, instruction mix, and divergence.

Conclusion

CUDA optimization is a systematic process: profile with Nsight Compute, read the Speed of Light metrics to identify compute-bound vs memory-bound, then apply fixes in priority order. Memory coalescing and tiling typically deliver 4-8x improvement. Kernel fusion eliminates launch overhead for small operations. Occupancy tuning and instruction optimization provide the final 20-50%. Never optimize without profiler data — intuition about GPU performance is almost always wrong.