You spend a week hand-tuning tensor core usage in a LayerNorm kernel, improving FLOPS by 40%. Throughput increases by 2%. The kernel was memory-bound — it achieves 85% of peak bandwidth and 12% of peak compute. Every optimization hour spent on compute was wasted. The roofline model would have told you this in five minutes by plotting operational intensity against hardware ceilings. A kernel below the ridge point is memory-bound; optimize memory. A kernel above the ridge is compute-bound; optimize compute. For LLM decode, 90% of kernels are memory-bound. For LLM prefill, 60% are compute-bound. Know which you are optimizing before you start.
For LLM inference, most operations in the decode phase are memory-bound (small batch sizes, large weight matrices), while most operations in the prefill phase are compute-bound (large batch sizes, matrix multiplies dominate). This post develops the methodology for determining the bound, diagnosing the bottleneck, and choosing the optimization strategy.
The Roofline Model
Definition
The roofline model plots achievable performance (FLOPS) as a function of arithmetic intensity (FLOPS per byte of memory traffic):
The arithmetic intensity of a kernel is:
Roofline for H100 SXM:
Peak FP16 Tensor Core: 990 TFLOPS
Peak INT8 Tensor Core: 1979 TFLOPS
HBM3 Bandwidth: 3.35 TB/s
Ridge point (FP16): 990e12 / 3.35e12 = 295 FLOPS/byte
Ridge point (INT8): 1979e12 / 3.35e12 = 591 FLOPS/byte
If AI < 295: kernel is memory-bound (FP16)
If AI > 295: kernel is compute-bound (FP16)
// The ridge point is EXTREMELY high on modern GPUs.
// This means most kernels are memory-bound unless they have
// very high arithmetic intensity (like large GEMMs).
Arithmetic Intensity of Common LLM Operations
# Calculating arithmetic intensity for LLM operations
def gemm_arithmetic_intensity(M, N, K, dtype_bytes):
"""
GEMM: C[M,N] = A[M,K] * B[K,N]
FLOPs: 2 * M * N * K (multiply-add)
Bytes: (M*K + K*N + M*N) * dtype_bytes (read A, read B, write C)
"""
flops = 2 * M * N * K
bytes_transferred = (M * K + K * N + M * N) * dtype_bytes
return flops / bytes_transferred
# LLM decode (batch_size=1, hidden_dim=4096):
# Linear projection: M=1, N=4096, K=4096, FP16
ai_decode = gemm_arithmetic_intensity(1, 4096, 4096, 2)
# FLOPs: 2 * 1 * 4096 * 4096 = 33.6M
# Bytes: (1*4096 + 4096*4096 + 1*4096) * 2 = 33.6M bytes
# AI = 33.6M / 33.6M = 1.0 FLOPS/byte -> DEEPLY memory-bound
# LLM prefill (batch_size=512, hidden_dim=4096):
# Linear projection: M=512, N=4096, K=4096, FP16
ai_prefill = gemm_arithmetic_intensity(512, 4096, 4096, 2)
# FLOPs: 2 * 512 * 4096 * 4096 = 17.2G
# Bytes: (512*4096 + 4096*4096 + 512*4096) * 2 = 41.9M bytes
# AI = 17.2G / 41.9M = 410 FLOPS/byte -> Compute-bound!
# The transition happens around M ≈ 2 * K * dtype / (2 * K) ≈ dtype
# For FP16 (2 bytes): M ≈ 2 for the ridge point
# More precisely: M_ridge = peak_flops / (bandwidth * 2 * N * K) ... complex
def softmax_arithmetic_intensity(batch_size, seq_len, dtype_bytes):
"""
Softmax over seq_len dimension:
FLOPs: ~5 * batch_size * seq_len (max, sub, exp, sum, div)
Bytes: 2 * batch_size * seq_len * dtype_bytes (read + write)
"""
flops = 5 * batch_size * seq_len
bytes_transferred = 2 * batch_size * seq_len * dtype_bytes
return flops / bytes_transferred
# AI = 5 / (2 * 2) = 1.25 FLOPS/byte for FP16
# Softmax is ALWAYS memory-bound
def layernorm_arithmetic_intensity(batch_size, hidden_dim, dtype_bytes):
"""
LayerNorm:
FLOPs: ~8 * batch_size * hidden_dim (mean, variance, normalize, affine)
Bytes: ~3 * batch_size * hidden_dim * dtype_bytes (read input, read gamma/beta, write output)
"""
flops = 8 * batch_size * hidden_dim
bytes_transferred = 3 * batch_size * hidden_dim * dtype_bytes
return flops / bytes_transferred
# AI = 8 / (3 * 2) = 1.33 FLOPS/byte for FP16
# LayerNorm is ALWAYS memory-bound
def attention_arithmetic_intensity(batch_size, n_heads, seq_len, head_dim, dtype_bytes):
"""
Self-attention (Q*K^T, softmax, attn*V):
FLOPs: 2 * batch_size * n_heads * seq_len^2 * head_dim * 2 (QK and AV)
Bytes: batch_size * n_heads * (2*seq_len*head_dim + seq_len^2) * dtype_bytes * 2
(read Q,K,V, read/write attention scores, write output)
"""
flops = 4 * batch_size * n_heads * seq_len * seq_len * head_dim
bytes_transferred = batch_size * n_heads * (
3 * seq_len * head_dim + # Q, K, V
2 * seq_len * seq_len + # attention scores (read + write)
seq_len * head_dim # output
) * dtype_bytes
return flops / bytes_transferred
Arithmetic Intensity of LLM Operations (FP16, d=4096)
| Operation | M (batch*seq) | AI (FLOPS/byte) | Bound | H100 Utilization |
|---|---|---|---|---|
| Linear (decode) | 1 | 1.0 | Memory | 0.3% of peak FLOPS |
| Linear (decode BS=8) | 8 | 7.9 | Memory | 2.7% |
| Linear (decode BS=32) | 32 | 30.8 | Memory | 10.4% |
| Linear (prefill 512) | 512 | 410 | Compute | 98% (near ridge) |
| LayerNorm | any | 1.33 | Memory | 0.5% |
| Softmax | any | 1.25 | Memory | 0.4% |
| RoPE | any | 2.5 | Memory | 0.8% |
Nsight Compute Metrics for Memory Analysis
Key Metrics to Examine
Nsight Compute (ncu) provides detailed metrics for diagnosing memory bottlenecks:
# Profile a kernel with memory throughput metrics
# ncu --metrics all --kernel-name "gemv_kernel" ./my_program
# Key metrics and what they mean:
# dram__bytes_read.sum / dram__bytes_write.sum
# Total bytes read/written from HBM
# Compare with theoretical minimum to find wasted bandwidth
# l2__throughput_percentage
# L2 cache throughput as percentage of peak
# High value (>80%) = memory subsystem is saturated
# sm__throughput_percentage
# SM compute throughput as percentage of peak
# High value (>80%) = compute is saturated
# Memory bound indicator:
# If l2__throughput > sm__throughput: memory-bound
# If sm__throughput > l2__throughput: compute-bound
# dram__sectors_read.sum / dram__sectors_write.sum
# Actual DRAM sectors accessed (32 bytes each)
# Divide by theoretical minimum to get memory efficiency
# l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum
# Shared memory load wavefronts - indicates bank conflicts
# smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg
# Average bytes utilized per sector for global loads
# 32 = perfect coalescing, <32 = wasted bandwidth
Diagnosing Uncoalesced Access
// Example: Diagnosing memory access patterns with ncu output
// Kernel with coalesced access (good):
__global__ void coalesced_load(float* data, float* output, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
output[idx] = data[idx] * 2.0f;
}
}
// ncu metric: smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg = 32
// All 32 bytes of each sector are utilized
// Kernel with strided access (bad):
__global__ void strided_load(float* data, float* output, int N, int stride) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int src_idx = idx * stride; // Stride > 1 causes uncoalesced access
if (src_idx < N) {
output[idx] = data[src_idx] * 2.0f;
}
}
// For stride=2:
// ncu metric: smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg = 16
// Only 50% of fetched bytes are used (2x bandwidth waste)
// For stride=32:
// ncu metric: smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg = 4
// Only 12.5% of fetched bytes are used (8x bandwidth waste)
// Kernel with random access (worst):
__global__ void random_load(float* data, float* output, int* indices, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
output[idx] = data[indices[idx]] * 2.0f; // Random indices
}
}
// ncu metric: smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg = 4-8
// Each thread fetches a full 32-byte sector but uses only 4 bytes
The single most impactful optimization for memory-bound kernels is improving coalescing. Going from 4 bytes/sector to 32 bytes/sector is an 8x improvement in effective memory bandwidth, which directly translates to an 8x speedup for a purely memory-bound kernel. Always check the bytes-per-sector metric first when profiling a memory-bound kernel.
L2 Cache Behavior and Optimization
L2 Cache Residency Control
The H100 has 50 MB of L2 cache. For LLM inference, if the weight matrix fits in L2, subsequent accesses are served from cache at much higher bandwidth:
// L2 cache residency hints (CUDA 11.x+, Ampere+)
// Persist data in L2 (higher priority for caching)
void set_l2_persistence(void* ptr, size_t size) {
cudaStreamAttrValue stream_attr;
stream_attr.accessPolicyWindow.base_ptr = ptr;
stream_attr.accessPolicyWindow.num_bytes = size;
stream_attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
stream_attr.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
stream_attr.accessPolicyWindow.hitRatio = 1.0f; // 100% of accesses should persist
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attr);
}
// Streaming access (do not cache, save L2 for other data)
void set_l2_streaming(void* ptr, size_t size) {
cudaStreamAttrValue stream_attr;
stream_attr.accessPolicyWindow.base_ptr = ptr;
stream_attr.accessPolicyWindow.num_bytes = size;
stream_attr.accessPolicyWindow.hitProp = cudaAccessPropertyStreaming;
stream_attr.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
stream_attr.accessPolicyWindow.hitRatio = 0.0f;
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attr);
}
// Application: LLM decode GEMV
// Weight matrix: 4096 x 4096 x 2 bytes = 32 MB
// H100 L2: 50 MB -> weight matrix fits!
// If we persist the weight in L2, repeated accesses (across batch)
// are served at L2 bandwidth (~12 TB/s) instead of HBM (3.35 TB/s)
// But: the weight changes every layer, so L2 persistence only helps
// if the same weight is accessed multiple times (e.g., batch>1)
L2 Cache Hit Rate Analysis
# Estimating L2 cache effectiveness for different workloads
def l2_analysis(weight_size_mb, activation_size_mb, l2_size_mb=50):
"""
Estimate L2 hit rate for GEMM operations.
"""
total_data = weight_size_mb + activation_size_mb
if total_data <= l2_size_mb:
# Everything fits in L2
# First access: compulsory miss (HBM bandwidth)
# Subsequent accesses: L2 hit (12 TB/s)
return "full_cache", 1.0
elif weight_size_mb <= l2_size_mb:
# Weights fit, activations stream through
# Weight reuse across batch dimension hits L2
weight_hit_rate = min(1.0, l2_size_mb / weight_size_mb)
return "weight_cached", weight_hit_rate
else:
# Nothing fits: streaming access
# L2 only helps with spatial locality within tiles
tile_hit_rate = min(1.0, l2_size_mb / total_data)
return "streaming", tile_hit_rate
# Llama-7B linear projection (4096 x 4096, FP16):
# Weight: 32 MB, Activation (BS=1): 8 KB
# Total: 32 MB < 50 MB -> full cache!
# Effective bandwidth: HBM on first access, L2 on reuse
# Llama-70B linear projection (8192 x 8192, FP16):
# Weight: 128 MB >> 50 MB -> streaming
# L2 can only cache tiles, not the full weight
# Implication: for 7B models at BS=1, L2 caching of weights
# can provide 2-3x effective bandwidth improvement if the same
# weight is accessed by multiple operations (e.g., GQA where
# K,V projections share key/value weights across heads)
Vectorized Memory Access
128-bit Loads for Maximum Bandwidth
Each memory transaction loads a 128-byte cache line. Using 128-bit (16-byte) vector loads ensures each thread uses 16/128 = 12.5% of the line, and a full warp of 32 threads uses 32*16 = 512 bytes = 4 cache lines, perfectly aligned:
// Vectorized load: float4 = 128 bits = 16 bytes per load
__global__ void vectorized_elementwise(
const float4* __restrict__ input,
float4* __restrict__ output,
int N_float4 // N/4
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N_float4) {
float4 val = input[idx];
// Process 4 floats at once
val.x = val.x * 2.0f + 1.0f;
val.y = val.y * 2.0f + 1.0f;
val.z = val.z * 2.0f + 1.0f;
val.w = val.w * 2.0f + 1.0f;
output[idx] = val;
}
}
// For FP16: use half2 (32-bit) or load as uint4 (128-bit = 8 FP16 values)
__global__ void vectorized_fp16(
const uint4* __restrict__ input, // 128-bit = 8 half values
uint4* __restrict__ output,
int N_uint4
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N_uint4) {
uint4 raw = input[idx];
half2* vals = reinterpret_cast<half2*>(&raw);
// Process 4 half2 pairs (8 FP16 values)
for (int i = 0; i < 4; i++) {
vals[i] = __hmul2(vals[i], __float2half2_rn(2.0f));
}
output[idx] = raw;
}
}
// Bandwidth comparison (H100, 1 billion FP16 elements):
// Scalar load (2 bytes per thread): achieved 2.1 TB/s (63% of peak)
// half2 load (4 bytes per thread): achieved 2.8 TB/s (84% of peak)
// uint4 load (16 bytes per thread): achieved 3.2 TB/s (96% of peak)
// Vectorized loads are essential for saturating HBM bandwidth
HBM Bandwidth Utilization by Access Pattern (H100, ElementWise Kernel)
(TB/s (peak = 3.35 TB/s))The Optimization Strategy Decision Tree
Based on the roofline analysis, follow this decision tree:
def optimization_strategy(kernel_metrics):
"""
Given Nsight Compute metrics, determine the optimization strategy.
"""
ai = kernel_metrics["arithmetic_intensity"]
ridge_point = kernel_metrics["peak_flops"] / kernel_metrics["peak_bandwidth"]
if ai < ridge_point * 0.5:
# Clearly memory-bound
return memory_bound_strategy(kernel_metrics)
elif ai > ridge_point * 1.5:
# Clearly compute-bound
return compute_bound_strategy(kernel_metrics)
else:
# Near ridge point: both matter
return balanced_strategy(kernel_metrics)
def memory_bound_strategy(metrics):
"""Optimization priorities for memory-bound kernels."""
optimizations = []
# Priority 1: Check coalescing
bytes_per_sector = metrics["bytes_per_sector_global_load"]
if bytes_per_sector < 28: # Less than 87.5% utilization
optimizations.append("FIX COALESCING: current {:.0f}/32 bytes per sector".format(
bytes_per_sector))
# Priority 2: Check vectorization
if not metrics["uses_vector_loads"]:
optimizations.append("ADD VECTORIZED LOADS: use float4/uint4 for 128-bit access")
# Priority 3: Check L2 hit rate
l2_hit_rate = metrics["l2_hit_rate"]
if l2_hit_rate < 0.3 and metrics["data_fits_l2"]:
optimizations.append("IMPROVE L2 RESIDENCY: data fits in L2 but hit rate is low")
# Priority 4: Reduce memory traffic
optimizations.append("CONSIDER DATA COMPRESSION: quantize activations, use INT8/FP8")
optimizations.append("CONSIDER KERNEL FUSION: eliminate intermediate writes")
# Priority 5: Overlap
optimizations.append("OVERLAP COMPUTE AND MEMORY: software pipelining in shared memory")
return optimizations
def compute_bound_strategy(metrics):
"""Optimization priorities for compute-bound kernels."""
optimizations = []
# Priority 1: Tensor core utilization
tc_utilization = metrics.get("tensor_core_utilization", 0)
if tc_utilization < 0.8:
optimizations.append("IMPROVE TENSOR CORE UTILIZATION: check tile alignment")
# Priority 2: Occupancy
occupancy = metrics["achieved_occupancy"]
if occupancy < 0.5:
optimizations.append("INCREASE OCCUPANCY: reduce register/shared memory pressure")
# Priority 3: Instruction mix
optimizations.append("REDUCE NON-MATH INSTRUCTIONS: minimize address computation, branches")
return optimizations
Optimization Strategy by LLM Operation
| Operation | Typical AI | Bound | Primary Optimization |
|---|---|---|---|
| Decode GEMV (BS=1) | 1.0 | Memory | Vectorized loads, weight compression (INT4) |
| Decode GEMV (BS=32) | 30 | Memory | L2 caching, batch coalescing |
| Prefill GEMM (BS=512) | 410 | Compute | Tensor core tile tuning, occupancy |
| LayerNorm | 1.3 | Memory | Fuse with adjacent ops, vectorize |
| Softmax | 1.25 | Memory | Online softmax, fuse with attention |
| RoPE | 2.5 | Memory | Fuse into QKV projection epilogue |
| Flash Attention (long seq) | ~50-200 | Balanced | Tile size tuning, shared mem staging |
Worked Example: Optimizing a GEMV Kernel
Starting from a naive GEMV and applying the roofline-guided optimization strategy:
// Step 0: Naive GEMV - y[M] = W[M,K] * x[K]
__global__ void gemv_naive(
const half* W, const half* x, half* y, int M, int K
) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= M) return;
float sum = 0.0f;
for (int k = 0; k < K; k++) {
sum += __half2float(W[row * K + k]) * __half2float(x[k]);
}
y[row] = __float2half(sum);
}
// Problem: each thread reads one row of W sequentially
// Access pattern: coalesced along K (good for row-major W)
// But: no vectorization, no shared memory for x reuse
// Measured: 1.2 TB/s (36% of peak) for M=4096, K=4096
// Step 1: Vectorized loads (from roofline strategy priority 2)
__global__ void gemv_vectorized(
const uint4* W, const half* x, half* y, int M, int K
) {
int row = blockIdx.x;
int tid = threadIdx.x; // 256 threads per block
int K_vec = K / 8; // 8 FP16 per uint4
float sum = 0.0f;
for (int k = tid; k < K_vec; k += blockDim.x) {
uint4 w_vec = W[row * K_vec + k];
half2* w = reinterpret_cast<half2*>(&w_vec);
// Load corresponding x values
// x is small (4096 * 2 = 8 KB), fits in L1 cache after first access
half2 x_vals[4];
for (int i = 0; i < 4; i++) {
x_vals[i] = *reinterpret_cast<const half2*>(&x[k * 8 + i * 2]);
}
// Dot product of 8 elements
for (int i = 0; i < 4; i++) {
half2 prod = __hmul2(w[i], x_vals[i]);
sum += __half2float(prod.x) + __half2float(prod.y);
}
}
// Warp reduction
for (int offset = 16; offset > 0; offset >>= 1) {
sum += __shfl_down_sync(0xffffffff, sum, offset);
}
// Block reduction
__shared__ float warp_sums[8]; // 256/32 = 8 warps
if (threadIdx.x % 32 == 0) {
warp_sums[threadIdx.x / 32] = sum;
}
__syncthreads();
if (threadIdx.x == 0) {
float total = 0.0f;
for (int i = 0; i < 8; i++) total += warp_sums[i];
y[row] = __float2half(total);
}
}
// Measured: 2.9 TB/s (87% of peak) for M=4096, K=4096
// Vectorization alone provided 2.4x speedup
// Step 2: Shared memory caching of x (priority 3: L2/L1 optimization)
// x is broadcast to all rows -> cache it in shared memory
__global__ void gemv_smem_x(
const uint4* W, const half* x, half* y, int M, int K
) {
extern __shared__ half x_shared[];
int row = blockIdx.x;
int tid = threadIdx.x;
// Cooperatively load x into shared memory
for (int i = tid; i < K; i += blockDim.x) {
x_shared[i] = x[i];
}
__syncthreads();
// Now access x from shared memory (much faster than L1 cache)
int K_vec = K / 8;
float sum = 0.0f;
for (int k = tid; k < K_vec; k += blockDim.x) {
uint4 w_vec = W[row * K_vec + k];
half2* w = reinterpret_cast<half2*>(&w_vec);
half2* xs = reinterpret_cast<half2*>(&x_shared[k * 8]);
for (int i = 0; i < 4; i++) {
half2 prod = __hmul2(w[i], xs[i]);
sum += __half2float(prod.x) + __half2float(prod.y);
}
}
// ... reduction same as above ...
}
// Measured: 3.1 TB/s (93% of peak) for M=4096, K=4096
// Marginal improvement over vectorized (x was already L1-cached)
// But helps when x is larger or accessed with poor spatial locality
GEMV Optimization Steps (M=4096, K=4096, FP16, H100)
(TB/s effective bandwidth)Summary
The roofline model determines whether a kernel is compute-bound or memory-bound by comparing its arithmetic intensity to the hardware’s ridge point (peak FLOPS / peak bandwidth). For H100 FP16, the ridge point is 295 FLOPS/byte, meaning most LLM decode operations (AI of 1-30) are deeply memory-bound, while prefill GEMMs (AI of 200-500) are compute-bound. The optimization strategy follows directly: for memory-bound kernels, optimize coalescing, use vectorized 128-bit loads, improve L2 cache residency, and reduce data volume through quantization or fusion. For compute-bound kernels, maximize tensor core utilization, tune tile sizes for occupancy, and minimize non-math instructions. Nsight Compute provides the metrics (bytes-per-sector, L2 throughput, SM throughput) needed to diagnose the specific bottleneck and track optimization progress.