Every GPU kernel you write or deploy sits somewhere on a two-dimensional spectrum: limited by how fast the hardware can compute, or limited by how fast it can move data. Most engineers optimize by instinct — trying loop unrolling when the kernel is memory-bound, or fiddling with memory access patterns when the kernel is compute-bound. The roofline model eliminates guesswork. It gives you a single diagram that tells you exactly which resource limits your kernel, how far you are from the hardware ceiling, and which class of optimizations can close the gap.
This post builds the roofline model from scratch, populates it with real GPU hardware numbers, places common LLM operations on the chart, walks through operational intensity calculations step by step, shows how to generate roofline plots with Nsight Compute, and lays out optimization strategies keyed to roofline position. It also covers the cases where the roofline model misleads you — because every model has blind spots.
What the Roofline Model Tells You
The core insight of the roofline model is that every kernel has a measurable property called operational intensity (also called arithmetic intensity), and the GPU has two hardware limits. The roofline diagram shows which limit constrains your kernel.
Operational intensity is the ratio of compute work to data movement:
The units are FLOP/byte. A kernel that performs 100 FLOPs and reads 50 bytes from DRAM has an operational intensity of 2 FLOP/byte. A kernel that performs 100,000 FLOPs and reads 50 bytes has an OI of 2,000 FLOP/byte.
The GPU provides two ceilings:
-
Peak compute throughput (): the maximum number of floating-point operations per second the hardware can sustain. For example, the A100 can deliver 312 TFLOP/s at FP16 with tensor cores.
-
Peak memory bandwidth (): the maximum number of bytes per second the hardware can transfer between DRAM and the compute units. The A100 delivers about 2,039 GB/s from HBM2e.
The roofline model says that the maximum achievable performance for a kernel with operational intensity is:
This equation defines two regimes. When , the kernel is memory-bound — performance scales linearly with operational intensity because you are limited by how fast data arrives. When , the kernel is compute-bound — performance is flat at because you are limited by how fast the ALUs can crunch numbers.
If your kernel is memory-bound, no amount of instruction-level optimization will help. You need to reduce bytes moved or increase data reuse. If your kernel is compute-bound, optimizing memory access patterns is wasted effort. You need faster math (tensor cores, reduced precision, better instruction mix). The roofline model prevents you from wasting time on the wrong class of optimization.
The ridge point is the operational intensity where the two ceilings meet:
Kernels with are memory-bound. Kernels with are compute-bound. The ridge point is a property of the hardware, not of your kernel.
Building a Roofline Diagram
A roofline diagram is a log-log plot with two axes:
- X-axis: operational intensity (FLOP/byte), logarithmic scale
- Y-axis: achievable performance (TFLOP/s or GFLOP/s), logarithmic scale
The diagram has two lines:
-
Memory-bound slope: a diagonal line where . On a log-log plot, this is a straight line with slope 1. It starts from the lower left and rises to the right.
-
Compute ceiling: a horizontal line at . This is the maximum compute throughput regardless of how much data reuse you achieve.
These two lines form the “roofline” shape — a sloped line that hits a flat ceiling. Every kernel you profile becomes a dot on this chart. The vertical distance between your dot and the roofline is the gap between your achieved performance and the hardware limit. The horizontal position tells you which limit applies.
Constructing the Lines
For the memory-bound slope, pick a few OI values and compute :
- At FLOP/byte: TFLOP/s (for A100)
- At FLOP/byte: TFLOP/s
- At FLOP/byte: TFLOP/s
The compute ceiling is a horizontal line at TFLOP/s (A100 FP16 tensor core).
The ridge point is where these meet:
On the log-log plot, the two lines intersect at . Everything to the left of is in the memory-bound region; everything to the right is compute-bound.
A100 Roofline Diagram (FP16 Tensor Core)
line| Metric | 0.5 | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 153 | 256 | 512 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Memory ceiling (BW x OI) | ||||||||||||
| Compute ceiling (312 TFLOP/s) |
The blue line represents the memory bandwidth ceiling: performance grows linearly with operational intensity because the bottleneck is data delivery. The red line is the flat compute ceiling. The roofline is the minimum of both — the blue line until it meets the red line, then the red line from there on.
Adding Sub-Ceilings
The basic roofline has two lines, but real hardware has additional constraints that create sub-ceilings below the main roofline:
- L2 cache bandwidth ceiling: if your kernel’s working set fits in L2, the effective bandwidth is higher (often 4-8x HBM bandwidth), shifting the memory slope upward.
- Shared memory bandwidth ceiling: even higher bandwidth from SRAM, but limited capacity.
- Non-tensor-core compute ceiling: if your kernel cannot use tensor cores (e.g., transcendental functions, integer ops), the compute ceiling drops significantly.
- FP32 vs FP16 ceilings: different data types have different peak throughputs.
Each sub-ceiling adds another line to the diagram, creating a “staircase” of rooflines. Your kernel is bounded by the lowest applicable ceiling.
Hardware Roofline for Popular GPUs
Let us build roofline parameters for the GPUs that dominate LLM serving today. I will show both theoretical peak and practical (approximately 80% of peak) numbers, because no kernel achieves 100% of theoretical throughput.
GPU Roofline Parameters (FP16 Tensor Core)
| GPU | Peak Compute (TFLOP/s) | Practical Compute (~80%) | Peak BW (GB/s) | Practical BW (~88%) | Ridge Point (Theoretical) | Ridge Point (Practical) |
|---|---|---|---|---|---|---|
| V100 (SXM2) | 125 | 100 | 900 | 790 | 139 FLOP/byte | 127 FLOP/byte |
| A100 (SXM) | 312 | 250 | 2,039 | 1,794 | 153 FLOP/byte | 139 FLOP/byte |
| H100 (SXM) | 990 | 792 | 3,350 | 2,948 | 296 FLOP/byte | 269 FLOP/byte |
| H200 (SXM) | 990 | 792 | 4,800 | 4,224 | 206 FLOP/byte | 188 FLOP/byte |
Each GPU generation increases compute throughput faster than memory bandwidth. The V100 ridge point is 139 FLOP/byte; the H100 ridge point is 296 FLOP/byte. This means more operations become memory-bound on newer hardware. An operation that was compute-bound on V100 may be memory-bound on H100. This trend makes bandwidth optimization increasingly critical.
Why Practical Efficiency Matters
Theoretical peak numbers assume perfect utilization of every cycle, every functional unit, every memory channel. In practice:
- Compute efficiency of 80% is good for large GEMMs using tensor cores. Smaller kernels, non-GEMM operations, and kernels with divergence typically achieve 40-70%.
- Bandwidth efficiency of 85-91% is achievable with coalesced access patterns. Strided or random access can drop to 20-50%.
The practical roofline is the one you should use for target-setting. If your kernel achieves 75% of the practical ceiling, that is generally excellent. If it achieves 30%, there is significant room for optimization.
Memory Bandwidth Across Platforms
| Platform | Memory Type | Theoretical BW | Achieved BW (STREAM) | Efficiency |
|---|---|---|---|---|
| Intel Xeon 8280 (2S) | DDR4-2933 | 281 GB/s | ~210 GB/s | 75% |
| AMD EPYC 7742 (2S) | DDR4-3200 | 410 GB/s | ~340 GB/s | 83% |
| NVIDIA V100 | HBM2 | 900 GB/s | ~820 GB/s | 91% |
| NVIDIA A100 | HBM2e | 2,039 GB/s | ~1,800 GB/s | 88% |
| NVIDIA H100 (SXM) | HBM3 | 3,350 GB/s | ~2,900 GB/s | 87% |
The GPU advantage over CPUs is stark: HBM delivers 5-15x more bandwidth than DDR. This is the primary reason GPUs dominate AI workloads — not raw compute (where CPUs have narrowed the gap with AVX-512 and AMX), but memory bandwidth.
LLM Operations on the Roofline
Large language model inference involves a handful of distinct operation types, each with very different operational intensities. Placing them on the roofline reveals why LLM serving is so challenging to optimize.
Decode Attention (Autoregressive Token Generation)
During decode, each new token attends to all previous tokens. For a single query token attending to key-value pairs with head dimension :
- FLOPs: approximately (two matmuls: Q x K^T and attn x V, each FLOPs)
- Bytes: you must load the entire KV cache: (K and V matrices)
- Operational intensity:
For FP16 (2 bytes per element): FLOP/byte. For INT8 KV cache: FLOP/byte.
This is extremely low — deep in the memory-bound zone on every GPU. Decode attention is fundamentally a memory-bandwidth problem. The kernel spends almost all its time loading KV cache from HBM, with minimal compute per byte loaded.
With batching, you amortize the KV cache reads across queries. The FLOPs scale as while the bytes stay roughly (shared KV cache), so . At batch size 32 with FP16: FLOP/byte. Still memory-bound on H100 (ridge at 296), but much better than batch=1.
Prefill Attention (Processing the Input Prompt)
During prefill, you process query tokens against key-value pairs simultaneously. The attention computation involves:
- FLOPs: approximately per head
- Bytes: (loading Q, K, V matrices) plus output writes
The operational intensity scales linearly with sequence length :
For at FP16: FLOP/byte — well into compute-bound territory. For at FP16: FLOP/byte — near the ridge point on A100, still memory-bound on H100.
This explains why FlashAttention’s primary benefit for prefill is reducing memory footprint (avoiding materializing the attention matrix), while for short sequences the bandwidth savings from tiling into SRAM dominate.
Large GEMM (Feed-Forward Network Layers)
The FFN in a transformer typically involves two large matrix multiplications. For a weight matrix of shape applied to an activation matrix of shape :
- FLOPs:
- Bytes:
For a typical LLM FFN with , (Llama-2 7B intermediate size), batch , FP16:
- FLOPs:
- Bytes:
- OI: FLOP/byte
This is near the ridge point on A100 and solidly compute-bound on V100. On H100, it is still slightly memory-bound. Increasing batch size pushes the OI higher because the weight matrix bytes are amortized.
LayerNorm and Softmax
These element-wise or reduction operations have very low operational intensity:
- LayerNorm: reads elements, computes mean and variance (about FLOPs), writes elements. FLOP/byte at FP16.
- Softmax: reads elements, computes max, exponentiation, sum, division (about FLOPs), writes elements. Similar OI of approximately 1-2 FLOP/byte.
These are solidly memory-bound on every GPU. The optimization strategy is to fuse them into adjacent compute-bound kernels (like fusing LayerNorm into the subsequent GEMM) so they never launch as standalone memory-bound kernels.
LLM Operations on the Roofline
| Operation | Typical OI (FLOP/byte) | Region on A100 | Region on H100 | Primary Bottleneck |
|---|---|---|---|---|
| Decode attention (B=1) | 1-2 | Deep memory-bound | Deep memory-bound | KV cache loading |
| Decode attention (B=32) | 16-32 | Memory-bound | Memory-bound | KV cache loading |
| Prefill attention (N=2048) | 500-1000 | Compute-bound | Compute-bound | Matmul throughput |
| Prefill attention (N=128) | 32-64 | Memory-bound | Memory-bound | Data loading |
| FFN GEMM (B=256) | 100-250 | Near ridge / compute | Memory-bound | Depends on shape |
| FFN GEMM (B=1) | 1-2 | Deep memory-bound | Deep memory-bound | Weight loading |
| LayerNorm | 1-2 | Deep memory-bound | Deep memory-bound | Elementwise I/O |
| Softmax | 1-2 | Deep memory-bound | Deep memory-bound | Elementwise I/O |
| RoPE embedding | 2-4 | Memory-bound | Memory-bound | Elementwise I/O |
Operational Intensity Spectrum of LLM Operations
(FLOP/byte)The picture is clear: most LLM serving time is spent on memory-bound operations, especially during the decode (token generation) phase. This is the fundamental reason why LLM inference optimization focuses so heavily on reducing memory traffic — quantization, KV cache compression, operator fusion, and batching.
How to Calculate Operational Intensity
Let us walk through the calculation in detail for the most important operation in deep learning: matrix multiplication.
Matrix Multiply:
Given of shape and of shape , producing of shape :
FLOPs: Each element of requires multiplications and additions, which we approximate as FLOPs. There are output elements. Total: FLOPs.
Bytes: We must load ( elements), load ( elements), and write ( elements). Total bytes:
Operational intensity:
How Shape Affects Intensity
The critical insight is that operational intensity depends on the matrix dimensions. Let us examine several cases:
Square matrices ():
At FP16 (): . For : FLOP/byte. Solidly compute-bound on any GPU.
Tall-skinny matrix times fat matrix (batch=1 inference: , weight matrix ):
At FP16: FLOP/byte. This is the worst case — you load the entire weight matrix to produce a single output vector. Catastrophically memory-bound.
How batch size rescues intensity (, large weight matrix ):
When and (typical for inference), the term dominates the denominator:
At FP16: . So batch size 1 gives OI of 1, batch size 32 gives OI of 32, batch size 256 gives OI of 256. This linear relationship between batch size and operational intensity is why batching is the single most effective optimization for inference throughput.
On A100 (ridge ~153 FLOP/byte FP16), you need batch size of roughly 153 to reach the ridge point for weight-dominated GEMMs. On H100 (ridge ~296), you need batch ~296. Below these thresholds, your GEMMs are memory-bound and you are paying for compute you cannot use. This is why continuous batching and dynamic batching are essential for efficient LLM serving.
Worked Example: Llama-2 7B FFN Layer
The Llama-2 7B FFN uses a gated architecture with three weight matrices:
- Gate projection:
- Up projection:
- Down projection:
For a batch of tokens at FP16:
Gate projection ():
- FLOPs:
- Bytes:
- OI: FLOP/byte
At OI = 62, this operation is memory-bound on both A100 (ridge 153) and H100 (ridge 296). The weight matrix ( MB) dominates the data transfer, and we only get 64 uses out of it.
To make this compute-bound on A100, we need . On H100, we need .
Using Nsight Compute Roofline
NVIDIA’s Nsight Compute (ncu) includes a built-in roofline analysis mode that automates much of what we have discussed. Here is how to use it effectively.
Generating the Roofline Plot
To collect roofline data for a specific kernel:
# Collect roofline metrics for all kernels
ncu --set roofline -o profile_output ./my_application
# Collect for a specific kernel by name
ncu --set roofline --kernel-name "my_kernel_name" -o profile_output ./my_application
# Collect with additional metrics for deeper analysis
ncu --set roofline --metrics \
dram__bytes_read.sum,dram__bytes_write.sum,\
sm__sass_thread_inst_executed_op_fadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_fmul_pred_on.sum,\
sm__sass_thread_inst_executed_op_ffma_pred_on.sum \
-o profile_output ./my_application
Open the resulting .ncu-rep file in Nsight Compute’s GUI. Navigate to the “Roofline” section of any kernel’s detailed view. You will see your kernel plotted as a dot on the roofline chart.
Reading the Roofline Plot
Nsight Compute’s roofline plot shows:
- Multiple ceilings: separate lines for FP64, FP32, FP16, INT8 compute, plus lines for HBM, L2, and L1/shared memory bandwidth.
- Your kernel as a dot: the horizontal position is the measured operational intensity, and the vertical position is the measured achieved performance.
- Distance to ceiling: the vertical gap between your dot and the nearest ceiling line.
Interpreting Nsight Compute Roofline Position
| Dot Position | Interpretation | Action |
|---|---|---|
| Close to memory ceiling, far left | Memory-bound, near peak BW utilization | Good BW usage; reduce bytes or increase reuse |
| Below memory ceiling, far left | Memory-bound, poor BW utilization | Fix access patterns: coalescing, alignment, caching |
| Close to compute ceiling, far right | Compute-bound, near peak compute utilization | Good; consider tensor cores if not already used |
| Below compute ceiling, far right | Compute-bound, poor compute utilization | Improve ILP, use tensor cores, reduce divergence |
| Below both ceilings, moderate OI | Latency-bound or under-occupied | Fix occupancy, reduce stalls, improve launch config |
Key Metrics from Nsight Compute
Beyond the roofline plot, these metrics help you understand your kernel’s position:
# Memory metrics
dram__bytes_read.sum -- Total bytes read from HBM
dram__bytes_write.sum -- Total bytes written to HBM
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum -- Global load bytes through L1
lts__t_bytes.sum -- L2 cache total bytes
# Compute metrics
sm__inst_executed_pipe_tensor.sum -- Tensor core instructions
sm__sass_thread_inst_executed.sum -- Total SASS instructions
sm__inst_executed_pipe_fma.sum -- FMA pipe instructions
# Utilization
sm__warps_active.avg_pct_of_peak_sustained_active -- Occupancy
gpu__time_duration.sum -- Kernel duration
Calculating achieved metrics from counters:
What It Means When a Kernel Is Far from the Ceiling
If your kernel’s dot is far below the applicable ceiling, the gap represents inefficiency that can be recovered. The causes depend on which ceiling applies:
Far below memory ceiling (memory-bound kernel with poor bandwidth utilization):
- Non-coalesced memory accesses: threads in a warp access non-contiguous addresses, wasting bus transactions
- Partition camping: uneven distribution of accesses across memory channels
- Excessive L2 cache misses from random access patterns
- Small transfers that do not saturate the memory bus
Far below compute ceiling (compute-bound kernel with poor compute utilization):
- Not using tensor cores when the operation supports them
- High register pressure causing low occupancy and pipeline stalls
- Branch divergence within warps
- Integer or special function unit bottlenecks not captured by FLOP counting
- Memory latency stalls that are hidden by the OI calculation but not by actual execution
Far below both ceilings (the worst case):
- The kernel is latency-bound, not throughput-bound
- Insufficient parallelism (too few warps, too small grid)
- Synchronization barriers causing idle time
- Kernel launch overhead dominating short kernels
Optimization Strategy Based on Roofline Position
The roofline model does not just diagnose — it prescribes. Each position on the diagram maps to a specific class of optimizations.
Below the Memory Ceiling: Improve Memory Access Patterns
Your kernel is memory-bound but not achieving peak bandwidth. The goal is to move the dot upward toward the memory ceiling.
Coalescing: Ensure threads in a warp access contiguous 128-byte aligned memory segments. A single non-coalesced load can cost 32x the bandwidth of a coalesced one.
// Bad: strided access, each thread reads from a different cache line
float val = data[threadIdx.x * stride];
// Good: coalesced access, adjacent threads read adjacent addresses
float val = data[blockIdx.x * blockDim.x + threadIdx.x];
L2 cache residency control (A100+): Use cudaAccessPropertyPersisting to pin frequently-accessed data in L2, effectively increasing bandwidth for that data.
cudaStreamAttrValue attr;
attr.accessPolicyWindow.base_ptr = kv_cache_ptr;
attr.accessPolicyWindow.num_bytes = cache_size;
attr.accessPolicyWindow.hitRatio = 1.0f;
attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
attr.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &attr);
Vectorized loads: Use float4 or int4 loads to read 16 bytes per thread per transaction instead of 4.
// 4x bandwidth utilization per instruction
float4 vec = reinterpret_cast<float4*>(data)[idx];
Prefetching: Use __pipeline_memcpy_async or cp.async to overlap data movement with compute.
On the Memory Ceiling: Reduce Bytes or Increase Reuse
Your kernel achieves good bandwidth utilization but is limited by the total bytes it must move. The goal is to move the dot to the right (increase OI) or reduce total bytes.
Operator fusion: Combine multiple memory-bound kernels into one. If LayerNorm, bias-add, and activation are three separate kernels, each reads and writes the full tensor. Fusing them into one kernel reduces memory traffic by 3x.
Quantization: Reduce from FP16 to INT8 or INT4. This halves or quarters the bytes transferred, directly increasing OI by 2-4x.
Shared memory tiling: Load data into shared memory (SRAM) once, then reuse it multiple times from fast on-chip storage.
KV cache compression: For decode attention, compressing the KV cache (quantization, eviction of unimportant tokens, sliding window) directly reduces the dominant memory cost.
Bandwidth Optimization Techniques
| Technique | Bytes Reduction | OI Improvement | Applicable Operations |
|---|---|---|---|
| FP16 to INT8 quantization | 2x | 2x | Weight loading, KV cache |
| FP16 to INT4 quantization | 4x | 4x | Weight loading |
| Operator fusion (3 ops) | ~3x | ~3x | LayerNorm + bias + activation |
| Shared memory tiling | 10-50x reuse | 10-50x | GEMM, convolution |
| Continuous batching | B x amortization | ~B x | All weight-dominated ops |
| KV cache sliding window | window/total reduction | Proportional | Decode attention |
Below the Compute Ceiling: Improve Instruction Throughput
Your kernel is compute-bound but not achieving peak FLOP/s. The goal is to move the dot upward toward the compute ceiling.
Use tensor cores: The gap between tensor core throughput and CUDA core throughput is enormous (8-16x on modern GPUs). If your operation can be expressed as matrix multiply-accumulate, use wmma or mma intrinsics, or rely on cuBLAS/CUTLASS.
// Using wmma for 16x16x16 FP16 matrix multiply
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
wmma::load_matrix_sync(a_frag, a_ptr, lda);
wmma::load_matrix_sync(b_frag, b_ptr, ldb);
wmma::fill_fragment(c_frag, 0.0f);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
Increase instruction-level parallelism (ILP): Unroll loops so the compiler can overlap independent instructions. Each thread should have multiple independent operations in flight.
Reduce branch divergence: Ensure threads within a warp take the same execution path. Divergent branches serialize execution, halving (or worse) throughput.
Epilogue fusion: Fuse pointwise operations (bias add, activation, residual add) into the GEMM epilogue so they execute while the tensor cores are finishing the last tile, hiding their cost entirely.
Near Both Ceilings: The Kernel Is Well-Optimized
If your kernel’s dot is close to the roofline (within 80% of the ceiling), the kernel is well-optimized for the current algorithm. Further gains require algorithmic changes:
- Change the mathematical formulation (e.g., FlashAttention replaces standard attention with a tiled, fused algorithm)
- Use a different precision (FP8 instead of FP16)
- Use a different hardware feature (sparse tensor cores for 2:4 structured sparsity)
- Redesign the data layout for the next kernel in the pipeline
Optimization Decision Matrix
| Roofline Position | Symptom (Nsight Compute) | Primary Optimization | Secondary Optimization |
|---|---|---|---|
| Below memory ceiling | High DRAM BW %, achieved BW far from peak | Fix coalescing and alignment | Vectorize loads (float4) |
| On memory ceiling | DRAM BW near peak, low compute % | Fuse operators to reduce traffic | Quantize to lower precision |
| Below compute ceiling | High SM %, achieved FLOP/s far from peak | Enable tensor cores | Increase ILP, reduce divergence |
| On compute ceiling | SM %, FLOP/s near peak | Algorithmic change required | Try lower precision (FP8) |
| Below both ceilings | Both utilizations low | Fix occupancy and launch config | Check for sync stalls |
Multi-Level Memory and the Extended Roofline
The basic two-line roofline model considers only HBM bandwidth. But modern GPUs have a deep memory hierarchy, and understanding which level your kernel actually operates at can change the analysis.
GPU Memory Hierarchy Bandwidth (A100)
(GB/s)Each memory level has a different bandwidth ceiling, creating a “staircase roofline.” If your kernel’s working set fits in L2 cache, the effective bandwidth ceiling is 3-4x higher than HBM, meaning the ridge point shifts right and your kernel may actually be more compute-bound than the HBM-only analysis suggests.
This matters for operations like:
- Small matrix multiplies: if weights fit in L2, OI measured at HBM undercounts effective intensity
- Fused kernels: intermediate results stay in shared memory, never touching HBM
- Repeated access to the same data: L2 hit rates of 80%+ effectively multiply bandwidth
The extended roofline approach is to measure bytes at each level separately and determine which level is the true bottleneck. Nsight Compute provides this breakdown through L1, L2, and DRAM traffic counters.
HBM Traffic Reduction Through Memory Hierarchy
(GB/s needed from HBM)A Complete Roofline Workflow
Here is the step-by-step workflow for using the roofline model to optimize any GPU kernel:
Step 1: Profile the kernel
ncu --set roofline --metrics \
dram__bytes_read.sum,dram__bytes_write.sum,\
gpu__time_duration.sum,\
sm__sass_thread_inst_executed_op_ffma_pred_on.sum \
--kernel-name "target_kernel" \
-o roofline_profile ./my_app
Step 2: Compute operational intensity from counters
Do not estimate OI from the algorithm — measure it from hardware counters. The actual bytes moved may differ from the theoretical minimum due to cache behavior, alignment, and redundant loads.
measured_bytes = dram__bytes_read.sum + dram__bytes_write.sum
measured_flops = (from instruction counters or known algorithm)
OI = measured_flops / measured_bytes
Step 3: Compute achieved performance
kernel_time = gpu__time_duration.sum
achieved_flops = measured_flops / kernel_time
achieved_bw = measured_bytes / kernel_time
Step 4: Classify the kernel
Compare to the ridge point. Compare to peak bandwidth. Compare to peak compute.
Kernel Classification from Profiler Data
| Symptom | OI vs Ridge | Likely Bound | Optimization Direction |
|---|---|---|---|
| High DRAM BW utilization, low SM utilization | OI far below ridge | Memory-bound | Reduce bytes: fuse, quantize, tile |
| High SM utilization, low DRAM BW utilization | OI far above ridge | Compute-bound | Improve math: tensor cores, ILP, precision |
| Both utilizations low | Any | Latency-bound / under-occupied | Fix occupancy, launch config, stalls |
| Both utilizations high | OI near ridge | Balanced (well-optimized) | Algorithmic change needed for further gains |
Step 5: Apply 1-2 targeted optimizations
Based on the classification, choose optimizations from the appropriate column. Do not apply random optimizations — the roofline tells you exactly which class of improvement is possible.
Step 6: Re-profile and verify the dot moved
After optimization, re-profile and check:
- Did OI change? (Should increase if you reduced bytes)
- Did achieved performance increase? (The dot should move up)
- Did the classification change? (Moving from memory-bound to compute-bound means the memory optimization succeeded)
If the dot moved in the wrong direction, the optimization had unintended side effects (e.g., reducing bytes but also reducing cache hit rates).
When the Roofline Model Misleads
The roofline model is powerful, but it has blind spots. Understanding when it fails is as important as understanding when it works.
Latency-Bound Operations
The roofline model assumes throughput-limited execution: enough parallel work to keep the hardware busy. When a kernel has insufficient parallelism, it is latency-bound — waiting for individual operations to complete rather than being limited by peak throughput.
Examples:
- Very small kernels (fewer than a few thousand threads): the GPU cannot fill its pipeline, and the kernel finishes before the hardware reaches steady state. Launch overhead may dominate.
- Sequential dependencies: if each operation depends on the previous result (long reduction chains, recurrent computations), the kernel cannot exploit parallelism even if each individual operation is fast.
- Synchronization-heavy kernels: frequent
__syncthreads()or atomic operations create bubbles where the hardware sits idle.
The roofline model will show these kernels as far below both ceilings, but the solution is not “improve bandwidth” or “improve compute” — it is to increase parallelism or restructure the algorithm.
Irregular Access Patterns
The roofline model assumes that memory bandwidth is a single number. In reality, bandwidth depends heavily on access patterns:
- Random scatter/gather: achievable bandwidth may be 10-50x lower than peak due to cache line waste and bank conflicts
- Sparse matrix operations: the irregular structure means some memory transactions carry useful data and others are padding
- Graph neural network traversals: neighbor access patterns are data-dependent and unpredictable
For these operations, the measured OI may look fine, but the achieved bandwidth is far below peak. The roofline model correctly identifies them as memory-bound, but the solution is not standard coalescing — it requires algorithmic changes to the access pattern (sorting, binning, padding, blocked sparse formats).
Multi-Kernel Interactions
The roofline model analyzes one kernel at a time. But real workloads involve sequences of kernels, and interactions between them matter:
- Kernel launch overhead: if you have 100 tiny kernels, the time is dominated by launch latency, not by any individual kernel’s compute or memory performance. The roofline for each kernel looks fine, but end-to-end throughput is poor.
- Cache pollution: a preceding kernel may evict data that the next kernel needs, causing unexpected cache misses. The roofline model for each kernel in isolation does not capture this.
- PCIe/NVLink transfers: data movement between host and device, or between GPUs, creates bottlenecks that are invisible to the per-kernel roofline.
- CUDA graph overhead vs. individual launches: batching kernels into a CUDA graph reduces launch overhead but does not change individual kernel rooflines.
The solution for multi-kernel problems is typically kernel fusion (combine multiple kernels into one) or CUDA graphs (batch launches to reduce overhead).
Misleading FLOP Counts
The roofline model requires accurate FLOP counts, which can be tricky:
- Wasted FLOPs from masking: attention masking computes values that are immediately zeroed out. The hardware executes FLOPs, but they produce no useful output. The “useful OI” is lower than measured OI.
- Predicated instructions: divergent warps execute both branches, but only one produces useful results. The FLOP count includes wasted work.
- Type conversion overhead: converting between FP16 and FP32 consumes ALU cycles but is not counted as “FLOPs” in the traditional sense.
The roofline model is a first-order approximation. It assumes perfect overlap of compute and memory, no latency effects, uniform bandwidth, and accurate FLOP accounting. When your kernel’s dot is far below both ceilings, the roofline model is telling you that something it does not model is the bottleneck. That is when you need to look at warp stall reasons, occupancy analysis, and instruction-level profiling — tools that complement rather than replace the roofline.
The Roofline Cannot Tell You About Tail Latency
For serving workloads, the p99 latency matters as much as throughput. A kernel that achieves 95% of peak throughput on average but occasionally stalls for 10x longer due to memory controller contention, thermal throttling, or OS interrupts will have a good roofline position but poor tail latency. The roofline model is strictly a throughput model and says nothing about latency distributions.
Putting It All Together: An LLM Serving Example
Consider optimizing a Llama-2 7B decode step on H100. The decode step involves, for each token:
- Embedding lookup: negligible compute, pure memory read
- 32 transformer layers, each containing:
- RMSNorm (LayerNorm variant): OI ~1.25 FLOP/byte
- Q/K/V projection (GEMM, B=1): OI ~1 FLOP/byte
- Decode attention: OI ~1 FLOP/byte
- Output projection (GEMM, B=1): OI ~1 FLOP/byte
- RMSNorm: OI ~1.25 FLOP/byte
- Gate + Up projection (GEMM, B=1): OI ~1 FLOP/byte
- SiLU activation: OI ~0.5 FLOP/byte
- Down projection (GEMM, B=1): OI ~1 FLOP/byte
- Final RMSNorm + LM head projection
At batch=1, every single operation is deep in the memory-bound zone. The roofline prescription is clear: reduce bytes. The optimization stack follows directly:
- Quantize weights to INT4: reduces GEMM bytes by 4x, pushing OI from ~1 to ~4 FLOP/byte. Still memory-bound, but 4x faster.
- Fuse RMSNorm + Q/K/V projection: eliminates the standalone RMSNorm memory traffic.
- Fuse SiLU + elementwise multiply: eliminates intermediate writes.
- Quantize KV cache to INT8: reduces decode attention bytes by 2x.
- Increase batch size to 32: pushes GEMM OI to ~32 FLOP/byte. Still memory-bound on H100 (ridge 296), but 32x better utilization.
- Increase batch size to 256+: pushes GEMM OI to ~256 FLOP/byte. Approaching the ridge point. GEMMs are now nearly compute-bound, and the investment in H100 tensor cores is finally justified.
This is the roofline model in action: it converts the abstract question “how do I make my LLM faster?” into a concrete, ordered list of optimizations with predictable impact.
Summary
The roofline model is the most useful single tool for GPU kernel optimization. It tells you:
- Whether your kernel is limited by compute or memory bandwidth
- How far your kernel is from the hardware ceiling
- Which class of optimizations can close the gap
- When you have reached the hardware limit and need algorithmic changes
The key numbers to remember: operational intensity is FLOPs divided by bytes. The ridge point is peak compute divided by peak bandwidth. Below the ridge, optimize for bandwidth. Above it, optimize for compute. Far below both ceilings, look for latency, occupancy, or parallelism problems.
For LLM workloads specifically, the roofline explains why batch size, quantization, and operator fusion are the three most impactful optimizations — they all increase operational intensity, moving operations rightward on the roofline toward the region where the hardware’s massive compute throughput can actually be utilized.
Measure, classify, optimize, re-measure. That is the roofline workflow, and it eliminates the guesswork that wastes engineering time on optimizations that cannot possibly help.