Part of Series CUDA Kernel Engineering 23 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

You spend a week hand-tuning tensor core usage in a LayerNorm kernel, improving FLOPS by 40%. Throughput increases by 2%. The kernel was memory-bound — it achieves 85% of peak bandwidth and 12% of peak compute. Every optimization hour spent on compute was wasted. The roofline model would have told you this in five minutes by plotting operational intensity against hardware ceilings. A kernel below the ridge point is memory-bound; optimize memory. A kernel above the ridge is compute-bound; optimize compute. For LLM decode, 90% of kernels are memory-bound. For LLM prefill, 60% are compute-bound. Know which you are optimizing before you start.

For LLM inference, most operations in the decode phase are memory-bound (small batch sizes, large weight matrices), while most operations in the prefill phase are compute-bound (large batch sizes, matrix multiplies dominate). This post develops the methodology for determining the bound, diagnosing the bottleneck, and choosing the optimization strategy.

The Roofline Model

Definition

The roofline model plots achievable performance (FLOPS) as a function of arithmetic intensity (FLOPS per byte of memory traffic):

Performance=min(Peak FLOPS,  Memory Bandwidth×Arithmetic Intensity)\text{Performance} = \min\left(\text{Peak FLOPS}, \; \text{Memory Bandwidth} \times \text{Arithmetic Intensity}\right)

The arithmetic intensity II of a kernel is:

I=Total floating-point operationsTotal bytes transferred between DRAM and compute unitsI = \frac{\text{Total floating-point operations}}{\text{Total bytes transferred between DRAM and compute units}}

Roofline for H100 SXM:

Peak FP16 Tensor Core:  990 TFLOPS
Peak INT8 Tensor Core: 1979 TFLOPS
HBM3 Bandwidth:        3.35 TB/s

Ridge point (FP16): 990e12 / 3.35e12 = 295 FLOPS/byte
Ridge point (INT8): 1979e12 / 3.35e12 = 591 FLOPS/byte

If AI < 295: kernel is memory-bound (FP16)
If AI > 295: kernel is compute-bound (FP16)

// The ridge point is EXTREMELY high on modern GPUs.
// This means most kernels are memory-bound unless they have
// very high arithmetic intensity (like large GEMMs).

Arithmetic Intensity of Common LLM Operations

# Calculating arithmetic intensity for LLM operations

def gemm_arithmetic_intensity(M, N, K, dtype_bytes):
    """
    GEMM: C[M,N] = A[M,K] * B[K,N]
    FLOPs: 2 * M * N * K (multiply-add)
    Bytes: (M*K + K*N + M*N) * dtype_bytes (read A, read B, write C)
    """
    flops = 2 * M * N * K
    bytes_transferred = (M * K + K * N + M * N) * dtype_bytes
    return flops / bytes_transferred

# LLM decode (batch_size=1, hidden_dim=4096):
# Linear projection: M=1, N=4096, K=4096, FP16
ai_decode = gemm_arithmetic_intensity(1, 4096, 4096, 2)
# FLOPs: 2 * 1 * 4096 * 4096 = 33.6M
# Bytes: (1*4096 + 4096*4096 + 1*4096) * 2 = 33.6M bytes
# AI = 33.6M / 33.6M = 1.0 FLOPS/byte  -> DEEPLY memory-bound

# LLM prefill (batch_size=512, hidden_dim=4096):
# Linear projection: M=512, N=4096, K=4096, FP16
ai_prefill = gemm_arithmetic_intensity(512, 4096, 4096, 2)
# FLOPs: 2 * 512 * 4096 * 4096 = 17.2G
# Bytes: (512*4096 + 4096*4096 + 512*4096) * 2 = 41.9M bytes
# AI = 17.2G / 41.9M = 410 FLOPS/byte  -> Compute-bound!

# The transition happens around M ≈ 2 * K * dtype / (2 * K) ≈ dtype
# For FP16 (2 bytes): M ≈ 2 for the ridge point
# More precisely: M_ridge = peak_flops / (bandwidth * 2 * N * K) ... complex

def softmax_arithmetic_intensity(batch_size, seq_len, dtype_bytes):
    """
    Softmax over seq_len dimension:
    FLOPs: ~5 * batch_size * seq_len (max, sub, exp, sum, div)
    Bytes: 2 * batch_size * seq_len * dtype_bytes (read + write)
    """
    flops = 5 * batch_size * seq_len
    bytes_transferred = 2 * batch_size * seq_len * dtype_bytes
    return flops / bytes_transferred
    # AI = 5 / (2 * 2) = 1.25 FLOPS/byte for FP16
    # Softmax is ALWAYS memory-bound

def layernorm_arithmetic_intensity(batch_size, hidden_dim, dtype_bytes):
    """
    LayerNorm:
    FLOPs: ~8 * batch_size * hidden_dim (mean, variance, normalize, affine)
    Bytes: ~3 * batch_size * hidden_dim * dtype_bytes (read input, read gamma/beta, write output)
    """
    flops = 8 * batch_size * hidden_dim
    bytes_transferred = 3 * batch_size * hidden_dim * dtype_bytes
    return flops / bytes_transferred
    # AI = 8 / (3 * 2) = 1.33 FLOPS/byte for FP16
    # LayerNorm is ALWAYS memory-bound

def attention_arithmetic_intensity(batch_size, n_heads, seq_len, head_dim, dtype_bytes):
    """
    Self-attention (Q*K^T, softmax, attn*V):
    FLOPs: 2 * batch_size * n_heads * seq_len^2 * head_dim * 2 (QK and AV)
    Bytes: batch_size * n_heads * (2*seq_len*head_dim + seq_len^2) * dtype_bytes * 2
    (read Q,K,V, read/write attention scores, write output)
    """
    flops = 4 * batch_size * n_heads * seq_len * seq_len * head_dim
    bytes_transferred = batch_size * n_heads * (
        3 * seq_len * head_dim +  # Q, K, V
        2 * seq_len * seq_len +   # attention scores (read + write)
        seq_len * head_dim         # output
    ) * dtype_bytes
    return flops / bytes_transferred
📊

Arithmetic Intensity of LLM Operations (FP16, d=4096)

OperationM (batch*seq)AI (FLOPS/byte)BoundH100 Utilization
Linear (decode) 1 1.0 Memory 0.3% of peak FLOPS
Linear (decode BS=8) 8 7.9 Memory 2.7%
Linear (decode BS=32) 32 30.8 Memory 10.4%
Linear (prefill 512) 512 410 Compute 98% (near ridge)
LayerNorm any 1.33 Memory 0.5%
Softmax any 1.25 Memory 0.4%
RoPE any 2.5 Memory 0.8%

Nsight Compute Metrics for Memory Analysis

Key Metrics to Examine

Nsight Compute (ncu) provides detailed metrics for diagnosing memory bottlenecks:

# Profile a kernel with memory throughput metrics
# ncu --metrics all --kernel-name "gemv_kernel" ./my_program

# Key metrics and what they mean:

# dram__bytes_read.sum / dram__bytes_write.sum
#    Total bytes read/written from HBM
#    Compare with theoretical minimum to find wasted bandwidth

# l2__throughput_percentage
#    L2 cache throughput as percentage of peak
#    High value (>80%) = memory subsystem is saturated

# sm__throughput_percentage
#    SM compute throughput as percentage of peak
#    High value (>80%) = compute is saturated

# Memory bound indicator:
#    If l2__throughput > sm__throughput: memory-bound
#    If sm__throughput > l2__throughput: compute-bound

# dram__sectors_read.sum / dram__sectors_write.sum
#    Actual DRAM sectors accessed (32 bytes each)
#    Divide by theoretical minimum to get memory efficiency

# l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum
#    Shared memory load wavefronts - indicates bank conflicts

# smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg
#    Average bytes utilized per sector for global loads
#    32 = perfect coalescing, <32 = wasted bandwidth

Diagnosing Uncoalesced Access

// Example: Diagnosing memory access patterns with ncu output

// Kernel with coalesced access (good):
__global__ void coalesced_load(float* data, float* output, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        output[idx] = data[idx] * 2.0f;
    }
}
// ncu metric: smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg = 32
// All 32 bytes of each sector are utilized

// Kernel with strided access (bad):
__global__ void strided_load(float* data, float* output, int N, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int src_idx = idx * stride;  // Stride > 1 causes uncoalesced access
    if (src_idx < N) {
        output[idx] = data[src_idx] * 2.0f;
    }
}
// For stride=2:
// ncu metric: smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg = 16
// Only 50% of fetched bytes are used (2x bandwidth waste)

// For stride=32:
// ncu metric: smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg = 4
// Only 12.5% of fetched bytes are used (8x bandwidth waste)

// Kernel with random access (worst):
__global__ void random_load(float* data, float* output, int* indices, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        output[idx] = data[indices[idx]] * 2.0f;  // Random indices
    }
}
// ncu metric: smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.avg = 4-8
// Each thread fetches a full 32-byte sector but uses only 4 bytes
ℹ️ Note

The single most impactful optimization for memory-bound kernels is improving coalescing. Going from 4 bytes/sector to 32 bytes/sector is an 8x improvement in effective memory bandwidth, which directly translates to an 8x speedup for a purely memory-bound kernel. Always check the bytes-per-sector metric first when profiling a memory-bound kernel.

L2 Cache Behavior and Optimization

L2 Cache Residency Control

The H100 has 50 MB of L2 cache. For LLM inference, if the weight matrix fits in L2, subsequent accesses are served from cache at much higher bandwidth:

// L2 cache residency hints (CUDA 11.x+, Ampere+)

// Persist data in L2 (higher priority for caching)
void set_l2_persistence(void* ptr, size_t size) {
    cudaStreamAttrValue stream_attr;
    stream_attr.accessPolicyWindow.base_ptr = ptr;
    stream_attr.accessPolicyWindow.num_bytes = size;
    stream_attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
    stream_attr.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
    stream_attr.accessPolicyWindow.hitRatio = 1.0f;  // 100% of accesses should persist

    cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
                           &stream_attr);
}

// Streaming access (do not cache, save L2 for other data)
void set_l2_streaming(void* ptr, size_t size) {
    cudaStreamAttrValue stream_attr;
    stream_attr.accessPolicyWindow.base_ptr = ptr;
    stream_attr.accessPolicyWindow.num_bytes = size;
    stream_attr.accessPolicyWindow.hitProp = cudaAccessPropertyStreaming;
    stream_attr.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
    stream_attr.accessPolicyWindow.hitRatio = 0.0f;

    cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
                           &stream_attr);
}

// Application: LLM decode GEMV
// Weight matrix: 4096 x 4096 x 2 bytes = 32 MB
// H100 L2: 50 MB -> weight matrix fits!
// If we persist the weight in L2, repeated accesses (across batch)
// are served at L2 bandwidth (~12 TB/s) instead of HBM (3.35 TB/s)

// But: the weight changes every layer, so L2 persistence only helps
// if the same weight is accessed multiple times (e.g., batch>1)

L2 Cache Hit Rate Analysis

# Estimating L2 cache effectiveness for different workloads

def l2_analysis(weight_size_mb, activation_size_mb, l2_size_mb=50):
    """
    Estimate L2 hit rate for GEMM operations.
    """
    total_data = weight_size_mb + activation_size_mb

    if total_data <= l2_size_mb:
        # Everything fits in L2
        # First access: compulsory miss (HBM bandwidth)
        # Subsequent accesses: L2 hit (12 TB/s)
        return "full_cache", 1.0

    elif weight_size_mb <= l2_size_mb:
        # Weights fit, activations stream through
        # Weight reuse across batch dimension hits L2
        weight_hit_rate = min(1.0, l2_size_mb / weight_size_mb)
        return "weight_cached", weight_hit_rate

    else:
        # Nothing fits: streaming access
        # L2 only helps with spatial locality within tiles
        tile_hit_rate = min(1.0, l2_size_mb / total_data)
        return "streaming", tile_hit_rate

# Llama-7B linear projection (4096 x 4096, FP16):
# Weight: 32 MB, Activation (BS=1): 8 KB
# Total: 32 MB < 50 MB -> full cache!
# Effective bandwidth: HBM on first access, L2 on reuse

# Llama-70B linear projection (8192 x 8192, FP16):
# Weight: 128 MB >> 50 MB -> streaming
# L2 can only cache tiles, not the full weight

# Implication: for 7B models at BS=1, L2 caching of weights
# can provide 2-3x effective bandwidth improvement if the same
# weight is accessed by multiple operations (e.g., GQA where
# K,V projections share key/value weights across heads)

Vectorized Memory Access

128-bit Loads for Maximum Bandwidth

Each memory transaction loads a 128-byte cache line. Using 128-bit (16-byte) vector loads ensures each thread uses 16/128 = 12.5% of the line, and a full warp of 32 threads uses 32*16 = 512 bytes = 4 cache lines, perfectly aligned:

// Vectorized load: float4 = 128 bits = 16 bytes per load
__global__ void vectorized_elementwise(
    const float4* __restrict__ input,
    float4* __restrict__ output,
    int N_float4  // N/4
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N_float4) {
        float4 val = input[idx];
        // Process 4 floats at once
        val.x = val.x * 2.0f + 1.0f;
        val.y = val.y * 2.0f + 1.0f;
        val.z = val.z * 2.0f + 1.0f;
        val.w = val.w * 2.0f + 1.0f;
        output[idx] = val;
    }
}

// For FP16: use half2 (32-bit) or load as uint4 (128-bit = 8 FP16 values)
__global__ void vectorized_fp16(
    const uint4* __restrict__ input,  // 128-bit = 8 half values
    uint4* __restrict__ output,
    int N_uint4
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N_uint4) {
        uint4 raw = input[idx];
        half2* vals = reinterpret_cast<half2*>(&raw);

        // Process 4 half2 pairs (8 FP16 values)
        for (int i = 0; i < 4; i++) {
            vals[i] = __hmul2(vals[i], __float2half2_rn(2.0f));
        }

        output[idx] = raw;
    }
}

// Bandwidth comparison (H100, 1 billion FP16 elements):
// Scalar load (2 bytes per thread):   achieved 2.1 TB/s (63% of peak)
// half2 load (4 bytes per thread):    achieved 2.8 TB/s (84% of peak)
// uint4 load (16 bytes per thread):   achieved 3.2 TB/s (96% of peak)
// Vectorized loads are essential for saturating HBM bandwidth

HBM Bandwidth Utilization by Access Pattern (H100, ElementWise Kernel)

(TB/s (peak = 3.35 TB/s))
Scalar FP16 (2B/thread) 63% utilization
2.1 TB/s (peak = 3.35 TB/s)
half2 (4B/thread) 84% utilization
2.8 TB/s (peak = 3.35 TB/s)
float4 / uint4 (16B/thread) 96% utilization
3.2 TB/s (peak = 3.35 TB/s)
Uncoalesced stride-2 45% utilization
1.5 TB/s (peak = 3.35 TB/s)
Random access 12% utilization
0.4 TB/s (peak = 3.35 TB/s)

The Optimization Strategy Decision Tree

Based on the roofline analysis, follow this decision tree:

def optimization_strategy(kernel_metrics):
    """
    Given Nsight Compute metrics, determine the optimization strategy.
    """
    ai = kernel_metrics["arithmetic_intensity"]
    ridge_point = kernel_metrics["peak_flops"] / kernel_metrics["peak_bandwidth"]

    if ai < ridge_point * 0.5:
        # Clearly memory-bound
        return memory_bound_strategy(kernel_metrics)
    elif ai > ridge_point * 1.5:
        # Clearly compute-bound
        return compute_bound_strategy(kernel_metrics)
    else:
        # Near ridge point: both matter
        return balanced_strategy(kernel_metrics)

def memory_bound_strategy(metrics):
    """Optimization priorities for memory-bound kernels."""
    optimizations = []

    # Priority 1: Check coalescing
    bytes_per_sector = metrics["bytes_per_sector_global_load"]
    if bytes_per_sector < 28:  # Less than 87.5% utilization
        optimizations.append("FIX COALESCING: current {:.0f}/32 bytes per sector".format(
            bytes_per_sector))

    # Priority 2: Check vectorization
    if not metrics["uses_vector_loads"]:
        optimizations.append("ADD VECTORIZED LOADS: use float4/uint4 for 128-bit access")

    # Priority 3: Check L2 hit rate
    l2_hit_rate = metrics["l2_hit_rate"]
    if l2_hit_rate < 0.3 and metrics["data_fits_l2"]:
        optimizations.append("IMPROVE L2 RESIDENCY: data fits in L2 but hit rate is low")

    # Priority 4: Reduce memory traffic
    optimizations.append("CONSIDER DATA COMPRESSION: quantize activations, use INT8/FP8")
    optimizations.append("CONSIDER KERNEL FUSION: eliminate intermediate writes")

    # Priority 5: Overlap
    optimizations.append("OVERLAP COMPUTE AND MEMORY: software pipelining in shared memory")

    return optimizations

def compute_bound_strategy(metrics):
    """Optimization priorities for compute-bound kernels."""
    optimizations = []

    # Priority 1: Tensor core utilization
    tc_utilization = metrics.get("tensor_core_utilization", 0)
    if tc_utilization < 0.8:
        optimizations.append("IMPROVE TENSOR CORE UTILIZATION: check tile alignment")

    # Priority 2: Occupancy
    occupancy = metrics["achieved_occupancy"]
    if occupancy < 0.5:
        optimizations.append("INCREASE OCCUPANCY: reduce register/shared memory pressure")

    # Priority 3: Instruction mix
    optimizations.append("REDUCE NON-MATH INSTRUCTIONS: minimize address computation, branches")

    return optimizations
📊

Optimization Strategy by LLM Operation

OperationTypical AIBoundPrimary Optimization
Decode GEMV (BS=1) 1.0 Memory Vectorized loads, weight compression (INT4)
Decode GEMV (BS=32) 30 Memory L2 caching, batch coalescing
Prefill GEMM (BS=512) 410 Compute Tensor core tile tuning, occupancy
LayerNorm 1.3 Memory Fuse with adjacent ops, vectorize
Softmax 1.25 Memory Online softmax, fuse with attention
RoPE 2.5 Memory Fuse into QKV projection epilogue
Flash Attention (long seq) ~50-200 Balanced Tile size tuning, shared mem staging

Worked Example: Optimizing a GEMV Kernel

Starting from a naive GEMV and applying the roofline-guided optimization strategy:

// Step 0: Naive GEMV - y[M] = W[M,K] * x[K]
__global__ void gemv_naive(
    const half* W, const half* x, half* y, int M, int K
) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= M) return;

    float sum = 0.0f;
    for (int k = 0; k < K; k++) {
        sum += __half2float(W[row * K + k]) * __half2float(x[k]);
    }
    y[row] = __float2half(sum);
}
// Problem: each thread reads one row of W sequentially
// Access pattern: coalesced along K (good for row-major W)
// But: no vectorization, no shared memory for x reuse
// Measured: 1.2 TB/s (36% of peak) for M=4096, K=4096

// Step 1: Vectorized loads (from roofline strategy priority 2)
__global__ void gemv_vectorized(
    const uint4* W, const half* x, half* y, int M, int K
) {
    int row = blockIdx.x;
    int tid = threadIdx.x;  // 256 threads per block
    int K_vec = K / 8;      // 8 FP16 per uint4

    float sum = 0.0f;
    for (int k = tid; k < K_vec; k += blockDim.x) {
        uint4 w_vec = W[row * K_vec + k];
        half2* w = reinterpret_cast<half2*>(&w_vec);

        // Load corresponding x values
        // x is small (4096 * 2 = 8 KB), fits in L1 cache after first access
        half2 x_vals[4];
        for (int i = 0; i < 4; i++) {
            x_vals[i] = *reinterpret_cast<const half2*>(&x[k * 8 + i * 2]);
        }

        // Dot product of 8 elements
        for (int i = 0; i < 4; i++) {
            half2 prod = __hmul2(w[i], x_vals[i]);
            sum += __half2float(prod.x) + __half2float(prod.y);
        }
    }

    // Warp reduction
    for (int offset = 16; offset > 0; offset >>= 1) {
        sum += __shfl_down_sync(0xffffffff, sum, offset);
    }

    // Block reduction
    __shared__ float warp_sums[8];  // 256/32 = 8 warps
    if (threadIdx.x % 32 == 0) {
        warp_sums[threadIdx.x / 32] = sum;
    }
    __syncthreads();

    if (threadIdx.x == 0) {
        float total = 0.0f;
        for (int i = 0; i < 8; i++) total += warp_sums[i];
        y[row] = __float2half(total);
    }
}
// Measured: 2.9 TB/s (87% of peak) for M=4096, K=4096
// Vectorization alone provided 2.4x speedup

// Step 2: Shared memory caching of x (priority 3: L2/L1 optimization)
// x is broadcast to all rows -> cache it in shared memory
__global__ void gemv_smem_x(
    const uint4* W, const half* x, half* y, int M, int K
) {
    extern __shared__ half x_shared[];
    int row = blockIdx.x;
    int tid = threadIdx.x;

    // Cooperatively load x into shared memory
    for (int i = tid; i < K; i += blockDim.x) {
        x_shared[i] = x[i];
    }
    __syncthreads();

    // Now access x from shared memory (much faster than L1 cache)
    int K_vec = K / 8;
    float sum = 0.0f;
    for (int k = tid; k < K_vec; k += blockDim.x) {
        uint4 w_vec = W[row * K_vec + k];
        half2* w = reinterpret_cast<half2*>(&w_vec);
        half2* xs = reinterpret_cast<half2*>(&x_shared[k * 8]);

        for (int i = 0; i < 4; i++) {
            half2 prod = __hmul2(w[i], xs[i]);
            sum += __half2float(prod.x) + __half2float(prod.y);
        }
    }

    // ... reduction same as above ...
}
// Measured: 3.1 TB/s (93% of peak) for M=4096, K=4096
// Marginal improvement over vectorized (x was already L1-cached)
// But helps when x is larger or accessed with poor spatial locality

GEMV Optimization Steps (M=4096, K=4096, FP16, H100)

(TB/s effective bandwidth)
Naive (scalar loads) 36% of peak
1.2 TB/s effective bandwidth
Vectorized (uint4 loads) 87% of peak
2.9 TB/s effective bandwidth
Vectorized + smem x 93% of peak
3.1 TB/s effective bandwidth
H100 HBM3 peak Theoretical max
3.35 TB/s effective bandwidth

Summary

The roofline model determines whether a kernel is compute-bound or memory-bound by comparing its arithmetic intensity to the hardware’s ridge point (peak FLOPS / peak bandwidth). For H100 FP16, the ridge point is 295 FLOPS/byte, meaning most LLM decode operations (AI of 1-30) are deeply memory-bound, while prefill GEMMs (AI of 200-500) are compute-bound. The optimization strategy follows directly: for memory-bound kernels, optimize coalescing, use vectorized 128-bit loads, improve L2 cache residency, and reduce data volume through quantization or fusion. For compute-bound kernels, maximize tensor core utilization, tune tile sizes for occupancy, and minimize non-math instructions. Nsight Compute provides the metrics (bytes-per-sector, L2 throughput, SM throughput) needed to diagnose the specific bottleneck and track optimization progress.