An A100 can perform 312 TFLOPS (FP16 with Tensor Cores) but only delivers 2,039 GB/s of memory bandwidth. That is a 156 FLOP/byte ratio. Any kernel with arithmetic intensity below 156 FLOP/byte is memory-bound โ the compute units sit idle waiting for data. A typical element-wise kernel (add, ReLU, exp) has arithmetic intensity under 1 FLOP/byte and achieves only 0.6% of peak compute. The optimization challenge is not doing more compute, it is moving data faster: coalesced accesses, vectorized loads, L2 cache exploitation, and async copy. Get these right and you hit 1,800+ GB/s โ 88% of hardware peak.
Measuring Your Bandwidth
Before optimizing, measure achieved bandwidth:
ncu --metrics \
dram__bytes_read.sum,dram__bytes_write.sum,\
l2__read_throughput.avg.pct_of_peak_sustained_active,\
sm__throughput.avg.pct_of_peak_sustained_active \
./my_kernel
Bandwidth Efficiency Targets
| Achieved % of Peak | Assessment | Action |
|---|---|---|
| over 85% | Excellent -- near hardware limit | Done. Optimize compute instead. |
| 60-85% | Good -- some room for improvement | Check coalescing and vectorized loads |
| 30-60% | Mediocre -- significant waste | Likely coalescing issues or L2 thrashing |
| under 30% | Poor -- major access pattern problems | Fix coalescing, check for strided/random access |
Technique 1: Vectorized Loads (float4, int4)
Loading 4 floats at once (float4 = 128 bits) instead of 1 float (32 bits) reduces the number of memory transactions and improves coalescing:
// Scalar: 32 threads x 4 bytes = 128 bytes per transaction
__global__ void kernel_scalar(float *in, float *out, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
out[idx] = in[idx] * 2.0f; // 1 float per thread
}
// Vectorized: 32 threads x 16 bytes = 512 bytes per transaction
__global__ void kernel_vec4(float4 *in, float4 *out, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float4 val = in[idx]; // 4 floats per thread
val.x *= 2.0f; val.y *= 2.0f; val.z *= 2.0f; val.w *= 2.0f;
out[idx] = val;
}
Bandwidth by Load Width (A100, element-wise kernel)
(GB/s)Use float4 loads whenever your data is 16-byte aligned and you process 4 contiguous elements per thread. This matches the GPUโs 128-bit memory bus width and minimizes transaction count. Most element-wise kernels (activation functions, normalization, residual add) benefit from vectorization.
Technique 2: L2 Cache Exploitation
The A100 has 40 MB of L2 cache. Kernels that access the same data multiple times can benefit from L2 residency:
L2 Cache Impact by Access Pattern
| Pattern | Working Set | L2 Hit Rate | Effective BW | Speedup vs No Cache |
|---|---|---|---|---|
| Streaming (no reuse) | >over 40 MB | under 5% | ~1,800 GB/s | 1x (baseline) |
| Temporal reuse (fits in L2) | under 40 MB | over 90% | ~4,500 GB/s | 2.5x |
| Reduction (multiple passes) | ~20 MB | ~85% | ~4,000 GB/s | 2.2x |
| Random access (large) | >over 40 MB | under 10% | ~300 GB/s | 0.17x (worse than streaming) |
// Hint to L2 cache: persist this data
// Ampere+ supports L2 cache persistence control
cudaAccessPolicyWindow window;
window.base_ptr = kv_cache_ptr;
window.num_bytes = kv_cache_size;
window.hitRatio = 1.0f; // Try to keep in L2
window.hitProp = cudaAccessPropertyPersisting;
window.missProp = cudaAccessPropertyStreaming;
cudaCtxSetAccessPolicyWindow(&window);
Technique 3: Async Copy (cp.async)
Ampere introduced cp.async for direct global->shared memory copy without going through registers, enabling overlap of copy and compute:
// Traditional: Global -> Registers -> Shared (serialized)
__shared__ float smem[BLOCK_SIZE];
smem[threadIdx.x] = global[idx]; // Load to register, then store to smem
__syncthreads();
// Async: Global -> Shared directly (can overlap with compute)
#include <cuda/pipeline>
__shared__ float smem[BLOCK_SIZE];
cuda::memcpy_async(&smem[threadIdx.x], &global[idx], sizeof(float), pipeline);
pipeline.commit();
// ... do other work while copy happens ...
pipeline.wait_priorunder 0>(); // Wait for copy to complete
Async Copy Impact on Tiled Kernels (A100)
| Kernel | Without cp.async | With cp.async | Speedup |
|---|---|---|---|
| GEMM (CUTLASS) | 280 TFLOPS | 295 TFLOPS | 1.05x |
| FlashAttention-2 | 145 TFLOPS | 156 TFLOPS | 1.07x |
| Custom reduction | 1,600 GB/s | 1,750 GB/s | 1.09x |
Technique 4: Memory Access Pattern Optimization
Common Anti-Patterns and Fixes
| Anti-Pattern | Achieved BW | Fix | Fixed BW |
|---|---|---|---|
| AoS (Array of Structures) | ~400 GB/s | Convert to SoA | ~1,600 GB/s |
| Column-major access in row-major data | ~200 GB/s | Transpose or tile through SMEM | ~1,500 GB/s |
| Misaligned base address | ~1,400 GB/s | Align to 128 bytes | ~1,750 GB/s |
| Small irregular accesses (gather) | ~100 GB/s | Sort + batch into coalesced runs | ~800 GB/s |
Array-of-Structures (AoS) layout where adjacent threads access different fields of the same struct causes strided access. Structure-of-Arrays (SoA) where each field is a contiguous array enables perfect coalescing. For GPU code, always default to SoA. The 4x bandwidth difference is real.
Optimization Checklist
Cumulative Impact of Bandwidth Optimizations (element-wise kernel)
(GB/s)Conclusion
GPU memory bandwidth optimization follows a clear priority: fix coalescing first (4x impact), then use vectorized loads (1.2x), then ensure alignment (1.05x). For tiled kernels, async copy and L2 cache exploitation provide additional gains. The target is over 85% of peak HBM bandwidth for streaming kernels and over 60% for complex access patterns. Always measure with ncu before and after โ the achieved bandwidth metric directly tells you how close you are to the hardware limit.