Part of Series CUDA Kernel Engineering 39 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

Every GPU kernel you write or deploy sits somewhere on a two-dimensional spectrum: limited by how fast the hardware can compute, or limited by how fast it can move data. Most engineers optimize by instinct — trying loop unrolling when the kernel is memory-bound, or fiddling with memory access patterns when the kernel is compute-bound. The roofline model eliminates guesswork. It gives you a single diagram that tells you exactly which resource limits your kernel, how far you are from the hardware ceiling, and which class of optimizations can close the gap.

This post builds the roofline model from scratch, populates it with real GPU hardware numbers, places common LLM operations on the chart, walks through operational intensity calculations step by step, shows how to generate roofline plots with Nsight Compute, and lays out optimization strategies keyed to roofline position. It also covers the cases where the roofline model misleads you — because every model has blind spots.

What the Roofline Model Tells You

The core insight of the roofline model is that every kernel has a measurable property called operational intensity (also called arithmetic intensity), and the GPU has two hardware limits. The roofline diagram shows which limit constrains your kernel.

Operational intensity is the ratio of compute work to data movement:

Operational Intensity (OI)=FLOPs performedBytes transferred to/from DRAM\text{Operational Intensity (OI)} = \frac{\text{FLOPs performed}}{\text{Bytes transferred to/from DRAM}}

The units are FLOP/byte. A kernel that performs 100 FLOPs and reads 50 bytes from DRAM has an operational intensity of 2 FLOP/byte. A kernel that performs 100,000 FLOPs and reads 50 bytes has an OI of 2,000 FLOP/byte.

The GPU provides two ceilings:

  1. Peak compute throughput (π\pi): the maximum number of floating-point operations per second the hardware can sustain. For example, the A100 can deliver 312 TFLOP/s at FP16 with tensor cores.

  2. Peak memory bandwidth (β\beta): the maximum number of bytes per second the hardware can transfer between DRAM and the compute units. The A100 delivers about 2,039 GB/s from HBM2e.

The roofline model says that the maximum achievable performance for a kernel with operational intensity OIOI is:

Pmax=min(π, OI×β)P_{\max} = \min(\pi,\ OI \times \beta)

This equation defines two regimes. When OI×β<πOI \times \beta \lt \pi, the kernel is memory-bound — performance scales linearly with operational intensity because you are limited by how fast data arrives. When OI×β>πOI \times \beta \gt \pi, the kernel is compute-bound — performance is flat at π\pi because you are limited by how fast the ALUs can crunch numbers.

Why This Matters

If your kernel is memory-bound, no amount of instruction-level optimization will help. You need to reduce bytes moved or increase data reuse. If your kernel is compute-bound, optimizing memory access patterns is wasted effort. You need faster math (tensor cores, reduced precision, better instruction mix). The roofline model prevents you from wasting time on the wrong class of optimization.

The ridge point is the operational intensity where the two ceilings meet:

OIridge=πβOI_{\text{ridge}} = \frac{\pi}{\beta}

Kernels with OI<OIridgeOI \lt OI_{\text{ridge}} are memory-bound. Kernels with OI>OIridgeOI \gt OI_{\text{ridge}} are compute-bound. The ridge point is a property of the hardware, not of your kernel.

Building a Roofline Diagram

A roofline diagram is a log-log plot with two axes:

  • X-axis: operational intensity (FLOP/byte), logarithmic scale
  • Y-axis: achievable performance (TFLOP/s or GFLOP/s), logarithmic scale

The diagram has two lines:

  1. Memory-bound slope: a diagonal line where P=β×OIP = \beta \times OI. On a log-log plot, this is a straight line with slope 1. It starts from the lower left and rises to the right.

  2. Compute ceiling: a horizontal line at P=πP = \pi. This is the maximum compute throughput regardless of how much data reuse you achieve.

These two lines form the “roofline” shape — a sloped line that hits a flat ceiling. Every kernel you profile becomes a dot on this chart. The vertical distance between your dot and the roofline is the gap between your achieved performance and the hardware limit. The horizontal position tells you which limit applies.

Constructing the Lines

For the memory-bound slope, pick a few OI values and compute P=β×OIP = \beta \times OI:

  • At OI=1OI = 1 FLOP/byte: P=2039×109×1=2.039P = 2039 \times 10^9 \times 1 = 2.039 TFLOP/s (for A100)
  • At OI=10OI = 10 FLOP/byte: P=20.39P = 20.39 TFLOP/s
  • At OI=100OI = 100 FLOP/byte: P=203.9P = 203.9 TFLOP/s

The compute ceiling is a horizontal line at π=312\pi = 312 TFLOP/s (A100 FP16 tensor core).

The ridge point is where these meet:

OIridge=312×10122039×109153 FLOP/byteOI_{\text{ridge}} = \frac{312 \times 10^{12}}{2039 \times 10^9} \approx 153\ \text{FLOP/byte}

On the log-log plot, the two lines intersect at (153,312)(153, 312). Everything to the left of OI=153OI = 153 is in the memory-bound region; everything to the right is compute-bound.

A100 Roofline Diagram (FP16 Tensor Core)

line
Metric 0.51248163264128153256512
Memory ceiling (BW x OI)
1
2
4.1
8.2
16.3
32.6
65.2
130.5
261
312
312
312
Compute ceiling (312 TFLOP/s)
312
312
312
312
312
312
312
312
312
312
312
312

The blue line represents the memory bandwidth ceiling: performance grows linearly with operational intensity because the bottleneck is data delivery. The red line is the flat compute ceiling. The roofline is the minimum of both — the blue line until it meets the red line, then the red line from there on.

Adding Sub-Ceilings

The basic roofline has two lines, but real hardware has additional constraints that create sub-ceilings below the main roofline:

  • L2 cache bandwidth ceiling: if your kernel’s working set fits in L2, the effective bandwidth is higher (often 4-8x HBM bandwidth), shifting the memory slope upward.
  • Shared memory bandwidth ceiling: even higher bandwidth from SRAM, but limited capacity.
  • Non-tensor-core compute ceiling: if your kernel cannot use tensor cores (e.g., transcendental functions, integer ops), the compute ceiling drops significantly.
  • FP32 vs FP16 ceilings: different data types have different peak throughputs.

Each sub-ceiling adds another line to the diagram, creating a “staircase” of rooflines. Your kernel is bounded by the lowest applicable ceiling.

Let us build roofline parameters for the GPUs that dominate LLM serving today. I will show both theoretical peak and practical (approximately 80% of peak) numbers, because no kernel achieves 100% of theoretical throughput.

📊

GPU Roofline Parameters (FP16 Tensor Core)

GPUPeak Compute (TFLOP/s)Practical Compute (~80%)Peak BW (GB/s)Practical BW (~88%)Ridge Point (Theoretical)Ridge Point (Practical)
V100 (SXM2) 125 100 900 790 139 FLOP/byte 127 FLOP/byte
A100 (SXM) 312 250 2,039 1,794 153 FLOP/byte 139 FLOP/byte
H100 (SXM) 990 792 3,350 2,948 296 FLOP/byte 269 FLOP/byte
H200 (SXM) 990 792 4,800 4,224 206 FLOP/byte 188 FLOP/byte
Note: Practical values assume ~80% compute efficiency and ~88% bandwidth efficiency, typical of well-optimized kernels. H200 shares H100's compute but has higher-bandwidth HBM3e.
💡 The Trend Across Generations

Each GPU generation increases compute throughput faster than memory bandwidth. The V100 ridge point is 139 FLOP/byte; the H100 ridge point is 296 FLOP/byte. This means more operations become memory-bound on newer hardware. An operation that was compute-bound on V100 may be memory-bound on H100. This trend makes bandwidth optimization increasingly critical.

Why Practical Efficiency Matters

Theoretical peak numbers assume perfect utilization of every cycle, every functional unit, every memory channel. In practice:

  • Compute efficiency of 80% is good for large GEMMs using tensor cores. Smaller kernels, non-GEMM operations, and kernels with divergence typically achieve 40-70%.
  • Bandwidth efficiency of 85-91% is achievable with coalesced access patterns. Strided or random access can drop to 20-50%.

The practical roofline is the one you should use for target-setting. If your kernel achieves 75% of the practical ceiling, that is generally excellent. If it achieves 30%, there is significant room for optimization.

📊

Memory Bandwidth Across Platforms

PlatformMemory TypeTheoretical BWAchieved BW (STREAM)Efficiency
Intel Xeon 8280 (2S) DDR4-2933 281 GB/s ~210 GB/s 75%
AMD EPYC 7742 (2S) DDR4-3200 410 GB/s ~340 GB/s 83%
NVIDIA V100 HBM2 900 GB/s ~820 GB/s 91%
NVIDIA A100 HBM2e 2,039 GB/s ~1,800 GB/s 88%
NVIDIA H100 (SXM) HBM3 3,350 GB/s ~2,900 GB/s 87%
Note: Achieved BW measured with STREAM triad (CPU) or equivalent bandwidth test (GPU). HBM consistently achieves 85-91% of theoretical.

The GPU advantage over CPUs is stark: HBM delivers 5-15x more bandwidth than DDR. This is the primary reason GPUs dominate AI workloads — not raw compute (where CPUs have narrowed the gap with AVX-512 and AMX), but memory bandwidth.

LLM Operations on the Roofline

Large language model inference involves a handful of distinct operation types, each with very different operational intensities. Placing them on the roofline reveals why LLM serving is so challenging to optimize.

Decode Attention (Autoregressive Token Generation)

During decode, each new token attends to all previous tokens. For a single query token attending to SS key-value pairs with head dimension dd:

  • FLOPs: approximately 4×S×d4 \times S \times d (two matmuls: Q x K^T and attn x V, each 2Sd2Sd FLOPs)
  • Bytes: you must load the entire KV cache: 2×S×d×dtype_size2 \times S \times d \times \text{dtype\_size} (K and V matrices)
  • Operational intensity: 4Sd2Sd×dtype_size=2dtype_size\frac{4Sd}{2Sd \times \text{dtype\_size}} = \frac{2}{\text{dtype\_size}}

For FP16 (2 bytes per element): OI=2/2=1OI = 2 / 2 = 1 FLOP/byte. For INT8 KV cache: OI=2/1=2OI = 2 / 1 = 2 FLOP/byte.

This is extremely low — deep in the memory-bound zone on every GPU. Decode attention is fundamentally a memory-bandwidth problem. The kernel spends almost all its time loading KV cache from HBM, with minimal compute per byte loaded.

With batching, you amortize the KV cache reads across BB queries. The FLOPs scale as B×4SdB \times 4Sd while the bytes stay roughly 2Sd×dtype_size2Sd \times \text{dtype\_size} (shared KV cache), so OI2Bdtype_sizeOI \approx \frac{2B}{\text{dtype\_size}}. At batch size 32 with FP16: OI32OI \approx 32 FLOP/byte. Still memory-bound on H100 (ridge at 296), but much better than batch=1.

Prefill Attention (Processing the Input Prompt)

During prefill, you process NN query tokens against NN key-value pairs simultaneously. The attention computation involves:

  • FLOPs: approximately 4×N2×d4 \times N^2 \times d per head
  • Bytes: 3×N×d×dtype_size3 \times N \times d \times \text{dtype\_size} (loading Q, K, V matrices) plus output writes

The operational intensity scales linearly with sequence length NN:

OI4N2d4Nd×dtype_size=Ndtype_sizeOI \approx \frac{4N^2 d}{4Nd \times \text{dtype\_size}} = \frac{N}{\text{dtype\_size}}

For N=2048N = 2048 at FP16: OI1024OI \approx 1024 FLOP/byte — well into compute-bound territory. For N=128N = 128 at FP16: OI64OI \approx 64 FLOP/byte — near the ridge point on A100, still memory-bound on H100.

This explains why FlashAttention’s primary benefit for prefill is reducing memory footprint (avoiding materializing the N×NN \times N attention matrix), while for short sequences the bandwidth savings from tiling into SRAM dominate.

Large GEMM (Feed-Forward Network Layers)

The FFN in a transformer typically involves two large matrix multiplications. For a weight matrix of shape (K,N)(K, N) applied to an activation matrix of shape (M,K)(M, K):

  • FLOPs: 2MNK2MNK
  • Bytes: (MK+KN+MN)×dtype_size(MK + KN + MN) \times \text{dtype\_size}

For a typical LLM FFN with K=4096K = 4096, N=11008N = 11008 (Llama-2 7B intermediate size), batch M=256M = 256, FP16:

  • FLOPs: 2×256×11008×4096=23.1×1092 \times 256 \times 11008 \times 4096 = 23.1 \times 10^9
  • Bytes: (256×4096+4096×11008+256×11008)×2=97.2×106(256 \times 4096 + 4096 \times 11008 + 256 \times 11008) \times 2 = 97.2 \times 10^6
  • OI: 23.1×109/97.2×10623823.1 \times 10^9 / 97.2 \times 10^6 \approx 238 FLOP/byte

This is near the ridge point on A100 and solidly compute-bound on V100. On H100, it is still slightly memory-bound. Increasing batch size pushes the OI higher because the weight matrix bytes are amortized.

LayerNorm and Softmax

These element-wise or reduction operations have very low operational intensity:

  • LayerNorm: reads NN elements, computes mean and variance (about 5N5N FLOPs), writes NN elements. OI5N2N×dtype_size=2.5dtype_size1.25OI \approx \frac{5N}{2N \times \text{dtype\_size}} = \frac{2.5}{\text{dtype\_size}} \approx 1.25 FLOP/byte at FP16.
  • Softmax: reads NN elements, computes max, exponentiation, sum, division (about 5N5N FLOPs), writes NN elements. Similar OI of approximately 1-2 FLOP/byte.

These are solidly memory-bound on every GPU. The optimization strategy is to fuse them into adjacent compute-bound kernels (like fusing LayerNorm into the subsequent GEMM) so they never launch as standalone memory-bound kernels.

📊

LLM Operations on the Roofline

OperationTypical OI (FLOP/byte)Region on A100Region on H100Primary Bottleneck
Decode attention (B=1) 1-2 Deep memory-bound Deep memory-bound KV cache loading
Decode attention (B=32) 16-32 Memory-bound Memory-bound KV cache loading
Prefill attention (N=2048) 500-1000 Compute-bound Compute-bound Matmul throughput
Prefill attention (N=128) 32-64 Memory-bound Memory-bound Data loading
FFN GEMM (B=256) 100-250 Near ridge / compute Memory-bound Depends on shape
FFN GEMM (B=1) 1-2 Deep memory-bound Deep memory-bound Weight loading
LayerNorm 1-2 Deep memory-bound Deep memory-bound Elementwise I/O
Softmax 1-2 Deep memory-bound Deep memory-bound Elementwise I/O
RoPE embedding 2-4 Memory-bound Memory-bound Elementwise I/O
Note: OI values assume FP16 data types. Batch size B and sequence length N dramatically affect operational intensity.

Operational Intensity Spectrum of LLM Operations

(FLOP/byte)
LayerNorm Deep memory-bound
1.25 FLOP/byte
Softmax Deep memory-bound
1.5 FLOP/byte
Decode attn (B=1) Deep memory-bound
1 FLOP/byte
Decode attn (B=32) Memory-bound
32 FLOP/byte
Prefill attn (N=128) Near ridge (A100)
64 FLOP/byte
FFN GEMM (B=256) Near ridge (H100)
238 FLOP/byte
Prefill attn (N=2048) Compute-bound
1,024 FLOP/byte
Large GEMM (4096x4096) Compute-bound
1,365 FLOP/byte

The picture is clear: most LLM serving time is spent on memory-bound operations, especially during the decode (token generation) phase. This is the fundamental reason why LLM inference optimization focuses so heavily on reducing memory traffic — quantization, KV cache compression, operator fusion, and batching.

How to Calculate Operational Intensity

Let us walk through the calculation in detail for the most important operation in deep learning: matrix multiplication.

Matrix Multiply: C=A×BC = A \times B

Given AA of shape (M,K)(M, K) and BB of shape (K,N)(K, N), producing CC of shape (M,N)(M, N):

FLOPs: Each element of CC requires KK multiplications and K1K-1 additions, which we approximate as 2K2K FLOPs. There are M×NM \times N output elements. Total: 2MNK2MNK FLOPs.

Bytes: We must load AA (M×KM \times K elements), load BB (K×NK \times N elements), and write CC (M×NM \times N elements). Total bytes:

Bytes=(MK+KN+MN)×dtype_size\text{Bytes} = (MK + KN + MN) \times \text{dtype\_size}

Operational intensity:

OI=2MNK(MK+KN+MN)×dtype_sizeOI = \frac{2MNK}{(MK + KN + MN) \times \text{dtype\_size}}

How Shape Affects Intensity

The critical insight is that operational intensity depends on the matrix dimensions. Let us examine several cases:

Square matrices (M=N=K=nM = N = K = n):

OI=2n33n2×dtype_size=2n3×dtype_sizeOI = \frac{2n^3}{3n^2 \times \text{dtype\_size}} = \frac{2n}{3 \times \text{dtype\_size}}

At FP16 (dtype_size=2\text{dtype\_size} = 2): OI=n/3OI = n/3. For n=4096n = 4096: OI=1365OI = 1365 FLOP/byte. Solidly compute-bound on any GPU.

Tall-skinny matrix times fat matrix (batch=1 inference: M=1M = 1, weight matrix K×NK \times N):

OI=2KN(K+KN+N)×dtype_size2KNKN×dtype_size=2dtype_sizeOI = \frac{2KN}{(K + KN + N) \times \text{dtype\_size}} \approx \frac{2KN}{KN \times \text{dtype\_size}} = \frac{2}{\text{dtype\_size}}

At FP16: OI1OI \approx 1 FLOP/byte. This is the worst case — you load the entire weight matrix to produce a single output vector. Catastrophically memory-bound.

How batch size rescues intensity (M=BM = B, large weight matrix K×NK \times N):

OI=2BKN(BK+KN+BN)×dtype_sizeOI = \frac{2BKN}{(BK + KN + BN) \times \text{dtype\_size}}

When BKB \ll K and BNB \ll N (typical for inference), the KNKN term dominates the denominator:

OI2BKNKN×dtype_size=2Bdtype_sizeOI \approx \frac{2BKN}{KN \times \text{dtype\_size}} = \frac{2B}{\text{dtype\_size}}

At FP16: OIBOI \approx B. So batch size 1 gives OI of 1, batch size 32 gives OI of 32, batch size 256 gives OI of 256. This linear relationship between batch size and operational intensity is why batching is the single most effective optimization for inference throughput.

The Batch Size Sweet Spot

On A100 (ridge ~153 FLOP/byte FP16), you need batch size of roughly 153 to reach the ridge point for weight-dominated GEMMs. On H100 (ridge ~296), you need batch ~296. Below these thresholds, your GEMMs are memory-bound and you are paying for compute you cannot use. This is why continuous batching and dynamic batching are essential for efficient LLM serving.

Worked Example: Llama-2 7B FFN Layer

The Llama-2 7B FFN uses a gated architecture with three weight matrices:

  • Gate projection: (4096,11008)(4096, 11008)
  • Up projection: (4096,11008)(4096, 11008)
  • Down projection: (11008,4096)(11008, 4096)

For a batch of B=64B = 64 tokens at FP16:

Gate projection (M=64,K=4096,N=11008M=64, K=4096, N=11008):

  • FLOPs: 2×64×4096×11008=5.77×1092 \times 64 \times 4096 \times 11008 = 5.77 \times 10^9
  • Bytes: (64×4096+4096×11008+64×11008)×2=92.8×106(64 \times 4096 + 4096 \times 11008 + 64 \times 11008) \times 2 = 92.8 \times 10^6
  • OI: 5.77×109/92.8×106=62.25.77 \times 10^9 / 92.8 \times 10^6 = 62.2 FLOP/byte

At OI = 62, this operation is memory-bound on both A100 (ridge 153) and H100 (ridge 296). The weight matrix (4096×11008×2=90.24096 \times 11008 \times 2 = 90.2 MB) dominates the data transfer, and we only get 64 uses out of it.

To make this compute-bound on A100, we need B153B \geq 153. On H100, we need B296B \geq 296.

Using Nsight Compute Roofline

NVIDIA’s Nsight Compute (ncu) includes a built-in roofline analysis mode that automates much of what we have discussed. Here is how to use it effectively.

Generating the Roofline Plot

To collect roofline data for a specific kernel:

# Collect roofline metrics for all kernels
ncu --set roofline -o profile_output ./my_application

# Collect for a specific kernel by name
ncu --set roofline --kernel-name "my_kernel_name" -o profile_output ./my_application

# Collect with additional metrics for deeper analysis
ncu --set roofline --metrics \
  dram__bytes_read.sum,dram__bytes_write.sum,\
  sm__sass_thread_inst_executed_op_fadd_pred_on.sum,\
  sm__sass_thread_inst_executed_op_fmul_pred_on.sum,\
  sm__sass_thread_inst_executed_op_ffma_pred_on.sum \
  -o profile_output ./my_application

Open the resulting .ncu-rep file in Nsight Compute’s GUI. Navigate to the “Roofline” section of any kernel’s detailed view. You will see your kernel plotted as a dot on the roofline chart.

Reading the Roofline Plot

Nsight Compute’s roofline plot shows:

  1. Multiple ceilings: separate lines for FP64, FP32, FP16, INT8 compute, plus lines for HBM, L2, and L1/shared memory bandwidth.
  2. Your kernel as a dot: the horizontal position is the measured operational intensity, and the vertical position is the measured achieved performance.
  3. Distance to ceiling: the vertical gap between your dot and the nearest ceiling line.
📊

Interpreting Nsight Compute Roofline Position

Dot PositionInterpretationAction
Close to memory ceiling, far left Memory-bound, near peak BW utilization Good BW usage; reduce bytes or increase reuse
Below memory ceiling, far left Memory-bound, poor BW utilization Fix access patterns: coalescing, alignment, caching
Close to compute ceiling, far right Compute-bound, near peak compute utilization Good; consider tensor cores if not already used
Below compute ceiling, far right Compute-bound, poor compute utilization Improve ILP, use tensor cores, reduce divergence
Below both ceilings, moderate OI Latency-bound or under-occupied Fix occupancy, reduce stalls, improve launch config

Key Metrics from Nsight Compute

Beyond the roofline plot, these metrics help you understand your kernel’s position:

# Memory metrics
dram__bytes_read.sum          -- Total bytes read from HBM
dram__bytes_write.sum         -- Total bytes written to HBM
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum  -- Global load bytes through L1
lts__t_bytes.sum              -- L2 cache total bytes

# Compute metrics
sm__inst_executed_pipe_tensor.sum   -- Tensor core instructions
sm__sass_thread_inst_executed.sum   -- Total SASS instructions
sm__inst_executed_pipe_fma.sum      -- FMA pipe instructions

# Utilization
sm__warps_active.avg_pct_of_peak_sustained_active  -- Occupancy
gpu__time_duration.sum        -- Kernel duration

Calculating achieved metrics from counters:

Achieved BW=dram_bytes_read+dram_bytes_writekernel_duration\text{Achieved BW} = \frac{\text{dram\_bytes\_read} + \text{dram\_bytes\_write}}{\text{kernel\_duration}} Achieved FLOP/s=total_FLOPskernel_duration\text{Achieved FLOP/s} = \frac{\text{total\_FLOPs}}{\text{kernel\_duration}} OI=total_FLOPsdram_bytes_read+dram_bytes_writeOI = \frac{\text{total\_FLOPs}}{\text{dram\_bytes\_read} + \text{dram\_bytes\_write}}

What It Means When a Kernel Is Far from the Ceiling

If your kernel’s dot is far below the applicable ceiling, the gap represents inefficiency that can be recovered. The causes depend on which ceiling applies:

Far below memory ceiling (memory-bound kernel with poor bandwidth utilization):

  • Non-coalesced memory accesses: threads in a warp access non-contiguous addresses, wasting bus transactions
  • Partition camping: uneven distribution of accesses across memory channels
  • Excessive L2 cache misses from random access patterns
  • Small transfers that do not saturate the memory bus

Far below compute ceiling (compute-bound kernel with poor compute utilization):

  • Not using tensor cores when the operation supports them
  • High register pressure causing low occupancy and pipeline stalls
  • Branch divergence within warps
  • Integer or special function unit bottlenecks not captured by FLOP counting
  • Memory latency stalls that are hidden by the OI calculation but not by actual execution

Far below both ceilings (the worst case):

  • The kernel is latency-bound, not throughput-bound
  • Insufficient parallelism (too few warps, too small grid)
  • Synchronization barriers causing idle time
  • Kernel launch overhead dominating short kernels

Optimization Strategy Based on Roofline Position

The roofline model does not just diagnose — it prescribes. Each position on the diagram maps to a specific class of optimizations.

Below the Memory Ceiling: Improve Memory Access Patterns

Your kernel is memory-bound but not achieving peak bandwidth. The goal is to move the dot upward toward the memory ceiling.

Coalescing: Ensure threads in a warp access contiguous 128-byte aligned memory segments. A single non-coalesced load can cost 32x the bandwidth of a coalesced one.

// Bad: strided access, each thread reads from a different cache line
float val = data[threadIdx.x * stride];

// Good: coalesced access, adjacent threads read adjacent addresses
float val = data[blockIdx.x * blockDim.x + threadIdx.x];

L2 cache residency control (A100+): Use cudaAccessPropertyPersisting to pin frequently-accessed data in L2, effectively increasing bandwidth for that data.

cudaStreamAttrValue attr;
attr.accessPolicyWindow.base_ptr = kv_cache_ptr;
attr.accessPolicyWindow.num_bytes = cache_size;
attr.accessPolicyWindow.hitRatio = 1.0f;
attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
attr.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &attr);

Vectorized loads: Use float4 or int4 loads to read 16 bytes per thread per transaction instead of 4.

// 4x bandwidth utilization per instruction
float4 vec = reinterpret_cast<float4*>(data)[idx];

Prefetching: Use __pipeline_memcpy_async or cp.async to overlap data movement with compute.

On the Memory Ceiling: Reduce Bytes or Increase Reuse

Your kernel achieves good bandwidth utilization but is limited by the total bytes it must move. The goal is to move the dot to the right (increase OI) or reduce total bytes.

Operator fusion: Combine multiple memory-bound kernels into one. If LayerNorm, bias-add, and activation are three separate kernels, each reads and writes the full tensor. Fusing them into one kernel reduces memory traffic by 3x.

Quantization: Reduce from FP16 to INT8 or INT4. This halves or quarters the bytes transferred, directly increasing OI by 2-4x.

Shared memory tiling: Load data into shared memory (SRAM) once, then reuse it multiple times from fast on-chip storage.

KV cache compression: For decode attention, compressing the KV cache (quantization, eviction of unimportant tokens, sliding window) directly reduces the dominant memory cost.

📊

Bandwidth Optimization Techniques

TechniqueBytes ReductionOI ImprovementApplicable Operations
FP16 to INT8 quantization 2x 2x Weight loading, KV cache
FP16 to INT4 quantization 4x 4x Weight loading
Operator fusion (3 ops) ~3x ~3x LayerNorm + bias + activation
Shared memory tiling 10-50x reuse 10-50x GEMM, convolution
Continuous batching B x amortization ~B x All weight-dominated ops
KV cache sliding window window/total reduction Proportional Decode attention

Below the Compute Ceiling: Improve Instruction Throughput

Your kernel is compute-bound but not achieving peak FLOP/s. The goal is to move the dot upward toward the compute ceiling.

Use tensor cores: The gap between tensor core throughput and CUDA core throughput is enormous (8-16x on modern GPUs). If your operation can be expressed as matrix multiply-accumulate, use wmma or mma intrinsics, or rely on cuBLAS/CUTLASS.

// Using wmma for 16x16x16 FP16 matrix multiply
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

wmma::load_matrix_sync(a_frag, a_ptr, lda);
wmma::load_matrix_sync(b_frag, b_ptr, ldb);
wmma::fill_fragment(c_frag, 0.0f);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

Increase instruction-level parallelism (ILP): Unroll loops so the compiler can overlap independent instructions. Each thread should have multiple independent operations in flight.

Reduce branch divergence: Ensure threads within a warp take the same execution path. Divergent branches serialize execution, halving (or worse) throughput.

Epilogue fusion: Fuse pointwise operations (bias add, activation, residual add) into the GEMM epilogue so they execute while the tensor cores are finishing the last tile, hiding their cost entirely.

Near Both Ceilings: The Kernel Is Well-Optimized

If your kernel’s dot is close to the roofline (within 80% of the ceiling), the kernel is well-optimized for the current algorithm. Further gains require algorithmic changes:

  • Change the mathematical formulation (e.g., FlashAttention replaces standard attention with a tiled, fused algorithm)
  • Use a different precision (FP8 instead of FP16)
  • Use a different hardware feature (sparse tensor cores for 2:4 structured sparsity)
  • Redesign the data layout for the next kernel in the pipeline
📊

Optimization Decision Matrix

Roofline PositionSymptom (Nsight Compute)Primary OptimizationSecondary Optimization
Below memory ceiling High DRAM BW %, achieved BW far from peak Fix coalescing and alignment Vectorize loads (float4)
On memory ceiling DRAM BW near peak, low compute % Fuse operators to reduce traffic Quantize to lower precision
Below compute ceiling High SM %, achieved FLOP/s far from peak Enable tensor cores Increase ILP, reduce divergence
On compute ceiling SM %, FLOP/s near peak Algorithmic change required Try lower precision (FP8)
Below both ceilings Both utilizations low Fix occupancy and launch config Check for sync stalls

Multi-Level Memory and the Extended Roofline

The basic two-line roofline model considers only HBM bandwidth. But modern GPUs have a deep memory hierarchy, and understanding which level your kernel actually operates at can change the analysis.

GPU Memory Hierarchy Bandwidth (A100)

(GB/s)
HBM2e 2 TB/s
2,039 GB/s
L2 Cache (40 MB) ~6 TB/s
6,000 GB/s
Shared Memory / L1 ~19 TB/s per SM aggregate
19,000 GB/s
Register File ~80 TB/s aggregate
80,000 GB/s

Each memory level has a different bandwidth ceiling, creating a “staircase roofline.” If your kernel’s working set fits in L2 cache, the effective bandwidth ceiling is 3-4x higher than HBM, meaning the ridge point shifts right and your kernel may actually be more compute-bound than the HBM-only analysis suggests.

This matters for operations like:

  • Small matrix multiplies: if weights fit in L2, OI measured at HBM undercounts effective intensity
  • Fused kernels: intermediate results stay in shared memory, never touching HBM
  • Repeated access to the same data: L2 hit rates of 80%+ effectively multiply bandwidth

The extended roofline approach is to measure bytes at each level separately and determine which level is the true bottleneck. Nsight Compute provides this breakdown through L1, L2, and DRAM traffic counters.

HBM Traffic Reduction Through Memory Hierarchy

(GB/s needed from HBM)
Naive (all from HBM) Exceeds HBM BW
2,000 GB/s needed from HBM
+ L2 cache hits (40%)
1,200 GB/s needed from HBM
+ SMEM tiling
400 GB/s needed from HBM
+ Operator fusion Within HBM budget
200 GB/s needed from HBM

A Complete Roofline Workflow

Here is the step-by-step workflow for using the roofline model to optimize any GPU kernel:

Step 1: Profile the kernel

ncu --set roofline --metrics \
  dram__bytes_read.sum,dram__bytes_write.sum,\
  gpu__time_duration.sum,\
  sm__sass_thread_inst_executed_op_ffma_pred_on.sum \
  --kernel-name "target_kernel" \
  -o roofline_profile ./my_app

Step 2: Compute operational intensity from counters

Do not estimate OI from the algorithm — measure it from hardware counters. The actual bytes moved may differ from the theoretical minimum due to cache behavior, alignment, and redundant loads.

measured_bytes = dram__bytes_read.sum + dram__bytes_write.sum
measured_flops = (from instruction counters or known algorithm)
OI = measured_flops / measured_bytes

Step 3: Compute achieved performance

kernel_time = gpu__time_duration.sum
achieved_flops = measured_flops / kernel_time
achieved_bw = measured_bytes / kernel_time

Step 4: Classify the kernel

Compare OIOI to the ridge point. Compare achieved_bw\text{achieved\_bw} to peak bandwidth. Compare achieved_flops\text{achieved\_flops} to peak compute.

📊

Kernel Classification from Profiler Data

SymptomOI vs RidgeLikely BoundOptimization Direction
High DRAM BW utilization, low SM utilization OI far below ridge Memory-bound Reduce bytes: fuse, quantize, tile
High SM utilization, low DRAM BW utilization OI far above ridge Compute-bound Improve math: tensor cores, ILP, precision
Both utilizations low Any Latency-bound / under-occupied Fix occupancy, launch config, stalls
Both utilizations high OI near ridge Balanced (well-optimized) Algorithmic change needed for further gains

Step 5: Apply 1-2 targeted optimizations

Based on the classification, choose optimizations from the appropriate column. Do not apply random optimizations — the roofline tells you exactly which class of improvement is possible.

Step 6: Re-profile and verify the dot moved

After optimization, re-profile and check:

  • Did OI change? (Should increase if you reduced bytes)
  • Did achieved performance increase? (The dot should move up)
  • Did the classification change? (Moving from memory-bound to compute-bound means the memory optimization succeeded)

If the dot moved in the wrong direction, the optimization had unintended side effects (e.g., reducing bytes but also reducing cache hit rates).

When the Roofline Model Misleads

The roofline model is powerful, but it has blind spots. Understanding when it fails is as important as understanding when it works.

Latency-Bound Operations

The roofline model assumes throughput-limited execution: enough parallel work to keep the hardware busy. When a kernel has insufficient parallelism, it is latency-bound — waiting for individual operations to complete rather than being limited by peak throughput.

Examples:

  • Very small kernels (fewer than a few thousand threads): the GPU cannot fill its pipeline, and the kernel finishes before the hardware reaches steady state. Launch overhead may dominate.
  • Sequential dependencies: if each operation depends on the previous result (long reduction chains, recurrent computations), the kernel cannot exploit parallelism even if each individual operation is fast.
  • Synchronization-heavy kernels: frequent __syncthreads() or atomic operations create bubbles where the hardware sits idle.

The roofline model will show these kernels as far below both ceilings, but the solution is not “improve bandwidth” or “improve compute” — it is to increase parallelism or restructure the algorithm.

Irregular Access Patterns

The roofline model assumes that memory bandwidth is a single number. In reality, bandwidth depends heavily on access patterns:

  • Random scatter/gather: achievable bandwidth may be 10-50x lower than peak due to cache line waste and bank conflicts
  • Sparse matrix operations: the irregular structure means some memory transactions carry useful data and others are padding
  • Graph neural network traversals: neighbor access patterns are data-dependent and unpredictable

For these operations, the measured OI may look fine, but the achieved bandwidth is far below peak. The roofline model correctly identifies them as memory-bound, but the solution is not standard coalescing — it requires algorithmic changes to the access pattern (sorting, binning, padding, blocked sparse formats).

Multi-Kernel Interactions

The roofline model analyzes one kernel at a time. But real workloads involve sequences of kernels, and interactions between them matter:

  • Kernel launch overhead: if you have 100 tiny kernels, the time is dominated by launch latency, not by any individual kernel’s compute or memory performance. The roofline for each kernel looks fine, but end-to-end throughput is poor.
  • Cache pollution: a preceding kernel may evict data that the next kernel needs, causing unexpected cache misses. The roofline model for each kernel in isolation does not capture this.
  • PCIe/NVLink transfers: data movement between host and device, or between GPUs, creates bottlenecks that are invisible to the per-kernel roofline.
  • CUDA graph overhead vs. individual launches: batching kernels into a CUDA graph reduces launch overhead but does not change individual kernel rooflines.

The solution for multi-kernel problems is typically kernel fusion (combine multiple kernels into one) or CUDA graphs (batch launches to reduce overhead).

Misleading FLOP Counts

The roofline model requires accurate FLOP counts, which can be tricky:

  • Wasted FLOPs from masking: attention masking computes values that are immediately zeroed out. The hardware executes FLOPs, but they produce no useful output. The “useful OI” is lower than measured OI.
  • Predicated instructions: divergent warps execute both branches, but only one produces useful results. The FLOP count includes wasted work.
  • Type conversion overhead: converting between FP16 and FP32 consumes ALU cycles but is not counted as “FLOPs” in the traditional sense.
💡 Roofline Is a Model, Not Reality

The roofline model is a first-order approximation. It assumes perfect overlap of compute and memory, no latency effects, uniform bandwidth, and accurate FLOP accounting. When your kernel’s dot is far below both ceilings, the roofline model is telling you that something it does not model is the bottleneck. That is when you need to look at warp stall reasons, occupancy analysis, and instruction-level profiling — tools that complement rather than replace the roofline.

The Roofline Cannot Tell You About Tail Latency

For serving workloads, the p99 latency matters as much as throughput. A kernel that achieves 95% of peak throughput on average but occasionally stalls for 10x longer due to memory controller contention, thermal throttling, or OS interrupts will have a good roofline position but poor tail latency. The roofline model is strictly a throughput model and says nothing about latency distributions.

Putting It All Together: An LLM Serving Example

Consider optimizing a Llama-2 7B decode step on H100. The decode step involves, for each token:

  1. Embedding lookup: negligible compute, pure memory read
  2. 32 transformer layers, each containing:
    • RMSNorm (LayerNorm variant): OI ~1.25 FLOP/byte
    • Q/K/V projection (GEMM, B=1): OI ~1 FLOP/byte
    • Decode attention: OI ~1 FLOP/byte
    • Output projection (GEMM, B=1): OI ~1 FLOP/byte
    • RMSNorm: OI ~1.25 FLOP/byte
    • Gate + Up projection (GEMM, B=1): OI ~1 FLOP/byte
    • SiLU activation: OI ~0.5 FLOP/byte
    • Down projection (GEMM, B=1): OI ~1 FLOP/byte
  3. Final RMSNorm + LM head projection

At batch=1, every single operation is deep in the memory-bound zone. The roofline prescription is clear: reduce bytes. The optimization stack follows directly:

  1. Quantize weights to INT4: reduces GEMM bytes by 4x, pushing OI from ~1 to ~4 FLOP/byte. Still memory-bound, but 4x faster.
  2. Fuse RMSNorm + Q/K/V projection: eliminates the standalone RMSNorm memory traffic.
  3. Fuse SiLU + elementwise multiply: eliminates intermediate writes.
  4. Quantize KV cache to INT8: reduces decode attention bytes by 2x.
  5. Increase batch size to 32: pushes GEMM OI to ~32 FLOP/byte. Still memory-bound on H100 (ridge 296), but 32x better utilization.
  6. Increase batch size to 256+: pushes GEMM OI to ~256 FLOP/byte. Approaching the ridge point. GEMMs are now nearly compute-bound, and the investment in H100 tensor cores is finally justified.

This is the roofline model in action: it converts the abstract question “how do I make my LLM faster?” into a concrete, ordered list of optimizations with predictable impact.

Summary

The roofline model is the most useful single tool for GPU kernel optimization. It tells you:

  • Whether your kernel is limited by compute or memory bandwidth
  • How far your kernel is from the hardware ceiling
  • Which class of optimizations can close the gap
  • When you have reached the hardware limit and need algorithmic changes

The key numbers to remember: operational intensity is FLOPs divided by bytes. The ridge point is peak compute divided by peak bandwidth. Below the ridge, optimize for bandwidth. Above it, optimize for compute. Far below both ceilings, look for latency, occupancy, or parallelism problems.

For LLM workloads specifically, the roofline explains why batch size, quantization, and operator fusion are the three most impactful optimizations — they all increase operational intensity, moving operations rightward on the roofline toward the region where the hardware’s massive compute throughput can actually be utilized.

Measure, classify, optimize, re-measure. That is the roofline workflow, and it eliminates the guesswork that wastes engineering time on optimizations that cannot possibly help.