An A100 can perform 312 TFLOPS (FP16 with Tensor Cores) but only delivers 2,039 GB/s of memory bandwidth. That is a 156 FLOP/byte ratio. Any kernel with arithmetic intensity below 156 FLOP/byte is memory-bound โ€” the compute units sit idle waiting for data. A typical element-wise kernel (add, ReLU, exp) has arithmetic intensity under 1 FLOP/byte and achieves only 0.6% of peak compute. The optimization challenge is not doing more compute, it is moving data faster: coalesced accesses, vectorized loads, L2 cache exploitation, and async copy. Get these right and you hit 1,800+ GB/s โ€” 88% of hardware peak.

Measuring Your Bandwidth

Before optimizing, measure achieved bandwidth:

ncu --metrics \
  dram__bytes_read.sum,dram__bytes_write.sum,\
  l2__read_throughput.avg.pct_of_peak_sustained_active,\
  sm__throughput.avg.pct_of_peak_sustained_active \
  ./my_kernel
๐Ÿ“Š

Bandwidth Efficiency Targets

Achieved % of PeakAssessmentAction
over 85% Excellent -- near hardware limit Done. Optimize compute instead.
60-85% Good -- some room for improvement Check coalescing and vectorized loads
30-60% Mediocre -- significant waste Likely coalescing issues or L2 thrashing
under 30% Poor -- major access pattern problems Fix coalescing, check for strided/random access
Note: A100 HBM peak: 2,039 GB/s. Achieved ~1,800 GB/s with optimized streaming kernels.

Technique 1: Vectorized Loads (float4, int4)

Loading 4 floats at once (float4 = 128 bits) instead of 1 float (32 bits) reduces the number of memory transactions and improves coalescing:

// Scalar: 32 threads x 4 bytes = 128 bytes per transaction
__global__ void kernel_scalar(float *in, float *out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = in[idx] * 2.0f;  // 1 float per thread
}

// Vectorized: 32 threads x 16 bytes = 512 bytes per transaction
__global__ void kernel_vec4(float4 *in, float4 *out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float4 val = in[idx];  // 4 floats per thread
    val.x *= 2.0f; val.y *= 2.0f; val.z *= 2.0f; val.w *= 2.0f;
    out[idx] = val;
}

Bandwidth by Load Width (A100, element-wise kernel)

(GB/s)
float (32-bit)
1,420 GB/s
float2 (64-bit)
1,650 GB/s
float4 (128-bit) Best -- matches HBM bus width
1,780 GB/s
๐Ÿ’ก When to Use float4

Use float4 loads whenever your data is 16-byte aligned and you process 4 contiguous elements per thread. This matches the GPUโ€™s 128-bit memory bus width and minimizes transaction count. Most element-wise kernels (activation functions, normalization, residual add) benefit from vectorization.

Technique 2: L2 Cache Exploitation

The A100 has 40 MB of L2 cache. Kernels that access the same data multiple times can benefit from L2 residency:

๐Ÿ“Š

L2 Cache Impact by Access Pattern

PatternWorking SetL2 Hit RateEffective BWSpeedup vs No Cache
Streaming (no reuse) >over 40 MB under 5% ~1,800 GB/s 1x (baseline)
Temporal reuse (fits in L2) under 40 MB over 90% ~4,500 GB/s 2.5x
Reduction (multiple passes) ~20 MB ~85% ~4,000 GB/s 2.2x
Random access (large) >over 40 MB under 10% ~300 GB/s 0.17x (worse than streaming)
// Hint to L2 cache: persist this data
// Ampere+ supports L2 cache persistence control
cudaAccessPolicyWindow window;
window.base_ptr = kv_cache_ptr;
window.num_bytes = kv_cache_size;
window.hitRatio = 1.0f;  // Try to keep in L2
window.hitProp = cudaAccessPropertyPersisting;
window.missProp = cudaAccessPropertyStreaming;
cudaCtxSetAccessPolicyWindow(&window);

Technique 3: Async Copy (cp.async)

Ampere introduced cp.async for direct global->shared memory copy without going through registers, enabling overlap of copy and compute:

// Traditional: Global -> Registers -> Shared (serialized)
__shared__ float smem[BLOCK_SIZE];
smem[threadIdx.x] = global[idx];  // Load to register, then store to smem
__syncthreads();

// Async: Global -> Shared directly (can overlap with compute)
#include <cuda/pipeline>
__shared__ float smem[BLOCK_SIZE];
cuda::memcpy_async(&smem[threadIdx.x], &global[idx], sizeof(float), pipeline);
pipeline.commit();
// ... do other work while copy happens ...
pipeline.wait_priorunder 0>();  // Wait for copy to complete
๐Ÿ“Š

Async Copy Impact on Tiled Kernels (A100)

KernelWithout cp.asyncWith cp.asyncSpeedup
GEMM (CUTLASS) 280 TFLOPS 295 TFLOPS 1.05x
FlashAttention-2 145 TFLOPS 156 TFLOPS 1.07x
Custom reduction 1,600 GB/s 1,750 GB/s 1.09x
Note: Async copy helps most when the kernel alternates between loading tiles and computing on them.

Technique 4: Memory Access Pattern Optimization

๐Ÿ“Š

Common Anti-Patterns and Fixes

Anti-PatternAchieved BWFixFixed BW
AoS (Array of Structures) ~400 GB/s Convert to SoA ~1,600 GB/s
Column-major access in row-major data ~200 GB/s Transpose or tile through SMEM ~1,500 GB/s
Misaligned base address ~1,400 GB/s Align to 128 bytes ~1,750 GB/s
Small irregular accesses (gather) ~100 GB/s Sort + batch into coalesced runs ~800 GB/s
โš ๏ธ AoS Is the Silent Killer

Array-of-Structures (AoS) layout where adjacent threads access different fields of the same struct causes strided access. Structure-of-Arrays (SoA) where each field is a contiguous array enables perfect coalescing. For GPU code, always default to SoA. The 4x bandwidth difference is real.

Optimization Checklist

Cumulative Impact of Bandwidth Optimizations (element-wise kernel)

(GB/s)
Naive (scalar, misaligned)
600 GB/s
+ Fix coalescing
1,400 GB/s
+ Vectorized loads (float4)
1,700 GB/s
+ Alignment (128B)
1,780 GB/s
A100 HBM peak Hardware limit
2,039 GB/s

Conclusion

GPU memory bandwidth optimization follows a clear priority: fix coalescing first (4x impact), then use vectorized loads (1.2x), then ensure alignment (1.05x). For tiled kernels, async copy and L2 cache exploitation provide additional gains. The target is over 85% of peak HBM bandwidth for streaming kernels and over 60% for complex access patterns. Always measure with ncu before and after โ€” the achieved bandwidth metric directly tells you how close you are to the hardware limit.