CUDA Kernel Optimization Techniques: A Systematic Approach

You just wrote a CUDA kernel and it runs at 180 GB/s on hardware capable of 2,039 GB/s — 9% of peak memory bandwidth. Guessing at optimizations wastes time. Instead, run ncu --set full and the profiler tells you exactly what is wrong: 23% memory bandwidth utilization, 89% L2 hit rate (good), but only 12% warp occupancy (bad — you are latency-bound). Fix occupancy by reducing register usage, rerun, and bandwidth jumps to 920 GB/s. This post walks through the systematic optimization workflow: profile, identify the bottleneck, apply the targeted fix, and reprofile to verify.

The Optimization Workflow

The process is always the same: profile -> identify bottleneck -> fix -> reprofile. Never optimize blind.

# Step 1: Profile the kernel
ncu --set full -o report ./my_application

# Step 2: Look at the Speed of Light (SOL) section
# It tells you: % of peak compute, % of peak memory bandwidth
# Whichever is higher is your bottleneck

📊

Speed of Light Interpretation

SOL Compute	SOL Memory	Bottleneck	Optimization Path
over 60%	under 40%	Compute-bound	Reduce instructions, use specialized units (tensor cores)
under 40%	over 60%	Memory-bound	Improve coalescing, add caching/tiling, reduce traffic
under 40%	under 40%	Latency-bound	Increase occupancy, overlap compute and memory
over 60%	over 60%	Well-balanced	Both resources utilized -- minor tuning only

Note: Most AI kernels are memory-bound or latency-bound. Compute-bound is the goal but rare outside of large GEMM.

Fix #1: Memory Coalescing (Biggest Impact)

The single most common problem. Non-coalesced global memory access wastes 4-32x bandwidth:

// BAD: Column-major access in row-major array -- stride = N per thread
__global__ void bad_coalescing(float *matrix, float *output, int N) {
    int row = threadIdx.x;  // Adjacent threads access different rows
    int col = blockIdx.x;
    output[row * N + col] = matrix[row * N + col] * 2.0f;
    // Thread 0: matrix[0], Thread 1: matrix[N], Thread 2: matrix[2N]
    // Stride = N -> non-coalesced!
}

// GOOD: Row-major access -- stride = 1 per thread
__global__ void good_coalescing(float *matrix, float *output, int N) {
    int col = threadIdx.x;  // Adjacent threads access adjacent columns
    int row = blockIdx.x;
    output[row * N + col] = matrix[row * N + col] * 2.0f;
    // Thread 0: matrix[row*N], Thread 1: matrix[row*N+1], ...
    // Stride = 1 -> perfectly coalesced!
}

Impact of Memory Coalescing Fix

(GB/s achieved bandwidth)

Non-coalesced (stride=N) 5% of peak

45 GB/s achieved bandwidth

Partially coalesced (stride=2)

420 GB/s achieved bandwidth

Fully coalesced (stride=1) 92% of peak

830 GB/s achieved bandwidth

⚡ How to Detect

In Nsight Compute, check l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum vs l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum. If sectors/request >> 1, you have coalescing issues. Ideal is 1 sector per request (32 threads x 4 bytes = 128 bytes = 1 sector).

Fix #2: Occupancy and Latency Hiding

Low occupancy means the GPU can’t hide memory latency by switching between warps:

📊

Common Occupancy Limiters

Limiter	How to Detect	Fix
Too many registers	ncu: Registers per thread over 64	Use __launch_bounds__, reduce local variables
Too much shared memory	ncu: Shared memory over 48KB/block	Reduce tile size, use multi-phase tiling
Block size too large	Block dim over 512 with high resources	Try 128 or 256 threads per block
Block size too small	Block dim under 128	Increase to at least 128 for latency hiding

// Limit register usage to improve occupancy
__global__ __launch_bounds__(256, 4)  // 256 threads, at least 4 blocks/SM
void optimized_kernel(float *data, int n) {
    // Compiler will spill registers to local memory if needed
    // to maintain the 4-blocks-per-SM target
}

Fix #3: Kernel Fusion (Eliminate Launch Overhead)

Each CUDA kernel launch costs ~5-10 us. For small kernels, this overhead dominates:

📊

Kernel Launch Overhead Analysis

Kernel	Compute Time	Launch Overhead	Overhead %	Action
Large GEMM	2.0 ms	5 us	0.25%	No action needed
LayerNorm	50 us	5 us	10%	Fuse with adjacent ops
Bias add	8 us	5 us	38%	Must fuse -- overhead dominates
Activation (GELU)	12 us	5 us	29%	Fuse with preceding GEMM

The solution: fuse small element-wise operations into the preceding or following GEMM kernel, or combine sequential element-wise ops into a single kernel:

// Unfused: 3 launches, 3 HBM round-trips
bias_add_kernel<<<grid, block>>>(output, bias, n);        // 8 us compute + 5 us launch
gelu_kernel<<<grid, block>>>(output, n);                  // 12 us compute + 5 us launch  
residual_add_kernel<<<grid, block>>>(output, residual, n); // 8 us compute + 5 us launch
// Total: 43 us

// Fused: 1 launch, 1 HBM round-trip
fused_bias_gelu_residual<<<grid, block>>>(output, bias, residual, n);
// Total: 16 us (2.7x faster)

Fix #4: Instruction Mix Optimization

For compute-bound kernels, the instruction mix matters. Special function units (SFU) for transcendentals are limited:

📊

Instruction Throughput (per SM per clock, V100)

Operation	Unit	Throughput	Notes
FP32 FMA	CUDA cores	64 ops/clock	Highest throughput
FP16 FMA (tensor)	Tensor cores	~1024 ops/clock	8x FP32 throughput
INT32	INT units	64 ops/clock	Same as FP32
exp/log/sin/cos	SFU (special function)	16 ops/clock	4x slower than FMA
Division	SFU	16 ops/clock	Use reciprocal + multiply instead

// Slow: uses SFU for division (16 ops/clock)
float result = a / b;

// Fast: reciprocal approximation + multiply (64 ops/clock)
float result = a * __frcp_rn(b);  // 1-ULP accurate reciprocal

// Slow: expf uses SFU
float result = expf(x);

// Faster for approximation: use __expf (less accurate but faster)
float result = __expf(x);  // Fast math intrinsic

⚠️ Fast Math Trade-offs

__expf, __logf, __sinf etc. are faster but have reduced precision (~2 ULP vs 1 ULP). For softmax denominators and loss computation, use the standard versions. For intermediate activations where 2-ULP error is acceptable, fast math intrinsics are free performance.

Fix #5: Avoid Warp Divergence

When threads within a warp take different branches, both paths execute serially:

// BAD: 50% of threads idle in each branch (warp divergence)
if (threadIdx.x % 2 == 0) {
    expensive_path_A(data);  // Half the warp
} else {
    expensive_path_B(data);  // Other half
}

// BETTER: Reorganize so entire warps take the same branch
int warp_id = threadIdx.x / 32;
if (warp_id % 2 == 0) {
    expensive_path_A(data);  // Entire warp A
} else {
    expensive_path_B(data);  // Entire warp B -- no divergence!
}

The Optimization Priority Checklist

Typical Impact by Optimization (ordered by expected speedup)

(x improvement)

Fix memory coalescing Fix first -- dominates

8 x improvement

Add shared memory tiling

4 x improvement

Fuse small kernels

2.5 x improvement

Fix occupancy

1.5 x improvement

Use tensor cores If applicable

1.5 x improvement

Fast math intrinsics

1.2 x improvement

Reduce divergence

1.1 x improvement

Always fix coalescing first. Then add tiling if there’s data reuse. Then fuse small kernels. Only after these should you worry about occupancy, instruction mix, and divergence.

Conclusion

CUDA optimization is a systematic process: profile with Nsight Compute, read the Speed of Light metrics to identify compute-bound vs memory-bound, then apply fixes in priority order. Memory coalescing and tiling typically deliver 4-8x improvement. Kernel fusion eliminates launch overhead for small operations. Occupancy tuning and instruction optimization provide the final 20-50%. Never optimize without profiler data — intuition about GPU performance is almost always wrong.