Part of Series CUDA Kernel Engineering 8 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

You optimize a CUDA kernel for two hours, improve occupancy from 50% to 100%, and throughput stays flat. You spent two hours optimizing the wrong thing. Nsight Compute would have told you in 30 seconds that the kernel is memory-bound at 95% of peak bandwidth — occupancy is irrelevant. The difference between guessing and profiling is the difference between random changes and systematic improvement. Nsight Systems identifies which kernels consume wall-clock time. Nsight Compute identifies why those kernels are slow. Used together in the correct order, they reduce optimization time from days to hours and eliminate wasted effort on irrelevant metrics.

This post covers the complete profiling workflow: tool selection, capture methodology, metric interpretation, and a worked example optimizing a transformer layer from 40% to 85% of the speed-of-light (SOL) target.

Tool Selection: Nsight Systems vs Nsight Compute

When to Use Which

📊

Nsight Systems vs Nsight Compute: Feature Comparison

CapabilityNsight SystemsNsight Compute
Scope Entire application Single kernel
Timeline view Yes (CPU + GPU + NCCL) No
CPU profiling Yes (call stacks) No
GPU kernel timing Yes (wall-clock) Yes (cycle-accurate)
Memory throughput No Yes (L1, L2, HBM breakdown)
Compute utilization No Yes (per-pipe: FMA, tensor, SFU)
Roofline analysis No Yes
Occupancy analysis No Yes (theoretical + achieved)
Stall reasons No Yes (warp stall breakdown)
Overhead Low (1-5%) High (10-100x per kernel)
NCCL / multi-GPU Yes No
Note: Rule: Start with Nsight Systems to find which kernels are slow. Then use Nsight Compute on those specific kernels.

The Profiling Workflow

1. Nsight Systems: Capture full timeline
   -> Identify: Which kernels take the most time?
   -> Identify: Are there CPU/GPU gaps (GPU idle)?
   -> Identify: Is NCCL communication overlapped?

2. Nsight Compute: Profile the top-N slowest kernels
   -> For each kernel: Is it compute-bound or memory-bound?
   -> If memory-bound: What is the achieved bandwidth?
   -> If compute-bound: What is the tensor core utilization?
   -> What are the stall reasons?

3. Fix: Apply targeted optimization
   -> Memory-bound: improve coalescing, reduce bank conflicts
   -> Compute-bound: increase occupancy, improve instruction mix
   -> CPU-bound: reduce kernel launch overhead, use CUDA graphs

4. Verify: Re-profile to confirm improvement

Nsight Systems: Capturing and Reading Timelines

Capture Commands

# Basic capture: profile entire application
nsys profile --trace=cuda,nvtx,osrt \
    --output=transformer_profile \
    python inference.py

# Capture with GPU metrics (SM activity, memory throughput)
nsys profile --trace=cuda,nvtx,osrt \
    --gpu-metrics-device=all \
    --output=transformer_profile_gpu \
    python inference.py

# Capture with NCCL communication (multi-GPU)
nsys profile --trace=cuda,nvtx,osrt,nccl \
    --output=multi_gpu_profile \
    torchrun --nproc_per_node=8 inference.py

# Delayed capture (skip warmup, capture only steady state)
nsys profile --trace=cuda,nvtx \
    --delay=5 --duration=10 \
    --output=steady_state \
    python server.py

# Export to JSON for programmatic analysis
nsys stats --format=json transformer_profile.nsys-rep \
    > profile_stats.json

NVTX Annotations for Context

Add NVTX markers to your code so the Nsight Systems timeline shows meaningful labels:

import torch
import nvtx  # pip install nvtx

class ProfiledTransformerLayer:
    """Transformer layer with NVTX annotations for Nsight Systems."""

    def __init__(self, layer, layer_idx):
        self.layer = layer
        self.layer_idx = layer_idx

    @nvtx.annotate(f"layer_forward", color="blue")
    def forward(self, hidden_states, kv_cache=None):
        with nvtx.annotate(f"L{self.layer_idx}_rmsnorm_1", color="green"):
            residual = hidden_states
            hidden_states = self.layer.input_layernorm(hidden_states)

        with nvtx.annotate(f"L{self.layer_idx}_attention", color="red"):
            with nvtx.annotate("qkv_proj", color="orange"):
                qkv = self.layer.self_attn.qkv_proj(hidden_states)

            with nvtx.annotate("rope", color="yellow"):
                q, k = apply_rotary_pos_emb(qkv)

            with nvtx.annotate("flash_attn", color="red"):
                attn_out = flash_attention(q, k, v, kv_cache)

            with nvtx.annotate("o_proj", color="orange"):
                attn_out = self.layer.self_attn.o_proj(attn_out)

        with nvtx.annotate(f"L{self.layer_idx}_residual_1", color="gray"):
            hidden_states = residual + attn_out

        with nvtx.annotate(f"L{self.layer_idx}_rmsnorm_2", color="green"):
            residual = hidden_states
            hidden_states = self.layer.post_attention_layernorm(hidden_states)

        with nvtx.annotate(f"L{self.layer_idx}_mlp", color="purple"):
            with nvtx.annotate("gate_up_proj", color="pink"):
                gate = self.layer.mlp.gate_proj(hidden_states)
                up = self.layer.mlp.up_proj(hidden_states)
            with nvtx.annotate("silu_mul", color="pink"):
                hidden_states = torch.nn.functional.silu(gate) * up
            with nvtx.annotate("down_proj", color="pink"):
                hidden_states = self.layer.mlp.down_proj(hidden_states)

        with nvtx.annotate(f"L{self.layer_idx}_residual_2", color="gray"):
            hidden_states = residual + hidden_states

        return hidden_states

Reading the Timeline

def interpret_nsys_timeline():
    """Key things to look for in a Nsight Systems timeline."""
    checks = {
        "GPU Idle Gaps": {
            "what": "Periods where no GPU kernel is running",
            "cause": "CPU overhead: Python, framework scheduling, "
                     "kernel launch latency",
            "fix": "CUDA graphs, reduce Python overhead, "
                   "torch.compile, larger batches",
        },
        "Kernel Serialization": {
            "what": "Kernels running one after another on the same stream "
                    "when they could overlap",
            "cause": "All kernels on default stream, or unnecessary "
                     "synchronization",
            "fix": "Use multiple CUDA streams for independent operations",
        },
        "NCCL Gaps": {
            "what": "GPU idle while waiting for NCCL collective",
            "cause": "Communication not overlapped with computation",
            "fix": "Pipeline parallelism, overlap allreduce with "
                   "next layer's computation",
        },
        "Small Kernels": {
            "what": "Many tiny kernels (under 10 us each)",
            "cause": "Unfused operations, each launching a separate kernel",
            "fix": "Kernel fusion (torch.compile, custom kernels)",
        },
        "Memory Transfer": {
            "what": "cudaMemcpy operations visible in timeline",
            "cause": "CPU-GPU data transfer (model loading, KV cache swap)",
            "fix": "Pinned memory, async transfers, prefetch",
        },
    }

    for name, info in checks.items():
        print(f"\n--- {name} ---")
        for key, val in info.items():
            print(f"  {key}: {val}")

Nsight Compute: Per-Kernel Deep Analysis

Capture Commands

# Profile a specific kernel (by name regex)
ncu --kernel-name "flash_fwd_kernel" \
    --set full \
    --output flash_attn_profile \
    python inference.py

# Profile all kernels (warning: very slow for large models)
ncu --set full \
    --output all_kernels \
    python inference.py

# Profile with roofline data
ncu --set roofline \
    --kernel-name ".*gemm.*" \
    --output gemm_roofline \
    python inference.py

# Limit number of kernel launches profiled
ncu --kernel-name ".*" \
    --launch-count 5 \
    --set full \
    --output quick_profile \
    python inference.py

# Compare two kernel implementations
ncu --set full --output kernel_v1 python run_v1.py
ncu --set full --output kernel_v2 python run_v2.py
# Open both in Nsight Compute GUI for side-by-side comparison

Key Metrics and What They Mean

def ncu_metric_interpretation():
    """Key Nsight Compute metrics and their interpretation."""
    metrics = {
        "SOL (Speed of Light)": {
            "section": "GPU Speed of Light Throughput",
            "metrics": {
                "sm__throughput.avg.pct_of_peak_sustained_elapsed":
                    "Compute throughput as % of peak. "
                    ">80% = compute-bound, <50% = likely memory-bound",
                "gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed":
                    "DRAM (HBM) throughput as % of peak. "
                    ">80% = memory-bound, <50% = other bottleneck",
            },
        },
        "Memory Workload Analysis": {
            "section": "Memory Workload Analysis",
            "metrics": {
                "l1tex__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_ld.sum":
                    "Bytes loaded from L2 to L1 (L1 misses). "
                    "Compare to ideal (coalesced) to find inefficiency",
                "dram__bytes_read.sum":
                    "Total HBM bytes read. Compare to "
                    "theoretical minimum for the algorithm",
                "l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum":
                    "Shared memory bank conflicts. "
                    "Non-zero = shared memory access pattern issue",
            },
        },
        "Compute Workload Analysis": {
            "section": "Compute Workload Analysis",
            "metrics": {
                "sm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_elapsed":
                    "Tensor core utilization. For GEMM kernels, "
                    "this should be >60%",
                "sm__pipe_fma_cycles_active.avg.pct_of_peak_sustained_elapsed":
                    "FMA pipe utilization. High = CUDA core compute-bound",
            },
        },
        "Occupancy": {
            "section": "Occupancy",
            "metrics": {
                "sm__warps_active.avg.pct_of_peak_sustained_active":
                    "Achieved occupancy. Low occupancy = "
                    "not enough warps to hide latency",
                "launch__occupancy_limit_registers":
                    "Occupancy limited by register usage. "
                    "Consider --maxrregcount",
                "launch__occupancy_limit_shared_mem":
                    "Occupancy limited by shared memory. "
                    "Consider reducing tile size",
            },
        },
        "Warp Stalls": {
            "section": "Warp State Statistics",
            "metrics": {
                "smsp__warps_issue_stalled_long_scoreboard_per_warp_active.ratio":
                    "Stalled waiting for global memory (L2/HBM). "
                    "High = memory latency issue",
                "smsp__warps_issue_stalled_math_pipe_throttle_per_warp_active.ratio":
                    "Stalled because math pipes are full. "
                    "High = compute-bound (good!)",
                "smsp__warps_issue_stalled_wait_per_warp_active.ratio":
                    "Stalled waiting for synchronization "
                    "(__syncthreads, barrier). High = load imbalance",
                "smsp__warps_issue_stalled_short_scoreboard_per_warp_active.ratio":
                    "Stalled on shared memory or L1 cache. "
                    "High = shared memory latency (bank conflicts?)",
            },
        },
    }

    for section_name, section_data in metrics.items():
        print(f"\n=== {section_name} ===")
        print(f"  Section: {section_data['section']}")
        for metric, desc in section_data['metrics'].items():
            print(f"  {metric}:")
            print(f"    {desc}")

The Roofline Model in Nsight Compute

Interpreting the Roofline Plot

The roofline model plots a kernel’s position based on its arithmetic intensity (FLOPs per byte of memory traffic) and achieved performance (FLOPS or bytes/sec).

def roofline_analysis(kernel_flops, kernel_bytes, kernel_time_s,
                       peak_flops=990e12, peak_bw=3350e9):
    """Analyze a kernel's position on the roofline model.

    Args:
        kernel_flops: Total FLOPs executed by the kernel
        kernel_bytes: Total bytes transferred (HBM) by the kernel
        kernel_time_s: Kernel execution time in seconds
        peak_flops: GPU peak FLOPS (H100: 990 TFLOPS FP16)
        peak_bw: GPU peak HBM bandwidth (H100: 3350 GB/s)
    """
    # Arithmetic intensity: FLOPs per byte
    arith_intensity = kernel_flops / kernel_bytes

    # Achieved performance
    achieved_flops = kernel_flops / kernel_time_s
    achieved_bw = kernel_bytes / kernel_time_s

    # Ridge point: where memory-bound transitions to compute-bound
    ridge_point = peak_flops / peak_bw  # FLOPs/byte

    # Roofline ceiling at this arithmetic intensity
    roofline_ceiling = min(peak_flops, peak_bw * arith_intensity)

    # SOL (speed of light) percentage
    sol_compute = (achieved_flops / peak_flops) * 100
    sol_memory = (achieved_bw / peak_bw) * 100

    # Classification
    if arith_intensity < ridge_point:
        classification = "MEMORY-BOUND"
        sol_target = sol_memory
        limiting = f"HBM bandwidth ({achieved_bw/1e9:.0f} / "
        limiting += f"{peak_bw/1e9:.0f} GB/s = {sol_memory:.1f}%)"
    else:
        classification = "COMPUTE-BOUND"
        sol_target = sol_compute
        limiting = f"Compute ({achieved_flops/1e12:.0f} / "
        limiting += f"{peak_flops/1e12:.0f} TFLOPS = {sol_compute:.1f}%)"

    print(f"=== Roofline Analysis ===")
    print(f"Arithmetic intensity: {arith_intensity:.1f} FLOPs/byte")
    print(f"Ridge point: {ridge_point:.1f} FLOPs/byte")
    print(f"Classification: {classification}")
    print(f"Limiting resource: {limiting}")
    print(f"Achieved vs ceiling: "
          f"{achieved_flops/1e12:.1f} / {roofline_ceiling/1e12:.1f} TFLOPS "
          f"= {(achieved_flops/roofline_ceiling)*100:.1f}%")

    return {
        'arith_intensity': arith_intensity,
        'classification': classification,
        'sol_compute': sol_compute,
        'sol_memory': sol_memory,
    }

Typical Kernel Classifications

📊

Roofline Classification of Transformer Operations (H100)

OperationArith. IntensityClassificationSOL TargetTypical Achieved
GEMM (prefill, M=2048) 256 FLOP/B Compute-bound 990 TFLOPS 700-850 TFLOPS (71-86%)
GEMM (decode, M=1) 1 FLOP/B Memory-bound 3350 GB/s 2500-3000 GB/s (75-90%)
FlashAttention (long ctx) ~64 FLOP/B Compute-bound 990 TFLOPS 500-700 TFLOPS (50-71%)
RMSNorm ~4 FLOP/B Memory-bound 3350 GB/s 2000-2800 GB/s (60-84%)
Bias + GELU ~6 FLOP/B Memory-bound 3350 GB/s 2200-2900 GB/s (66-87%)
Residual add ~1 FLOP/B Memory-bound 3350 GB/s 2500-3100 GB/s (75-93%)
Softmax ~5 FLOP/B Memory-bound 3350 GB/s 1800-2500 GB/s (54-75%)
Note: GEMMs in prefill are compute-bound (high arithmetic intensity). Everything else is memory-bound. Optimization strategies differ: compute-bound kernels need higher tensor core utilization; memory-bound kernels need better coalescing and reduced traffic.

Worked Example: Profiling a Transformer Layer

Step 1: Nsight Systems Capture

# inference_profile.py - Script to profile
import torch
import nvtx

def profile_transformer_layer():
    device = 'cuda'
    batch = 1
    seq_len = 2048
    hidden = 4096
    heads = 32
    head_dim = hidden // heads

    # Simulate transformer layer components
    qkv_proj = torch.nn.Linear(hidden, 3 * hidden, device=device,
                                dtype=torch.float16)
    o_proj = torch.nn.Linear(hidden, hidden, device=device,
                              dtype=torch.float16)
    gate_proj = torch.nn.Linear(hidden, 11008, device=device,
                                 dtype=torch.float16)
    up_proj = torch.nn.Linear(hidden, 11008, device=device,
                               dtype=torch.float16)
    down_proj = torch.nn.Linear(11008, hidden, device=device,
                                 dtype=torch.float16)
    norm = torch.nn.LayerNorm(hidden, device=device, dtype=torch.float16)

    x = torch.randn(batch, seq_len, hidden, device=device,
                     dtype=torch.float16)

    # Warmup
    for _ in range(10):
        _ = qkv_proj(norm(x))

    # Profile region
    torch.cuda.synchronize()
    with nvtx.annotate("transformer_layer", color="blue"):
        for _ in range(100):
            with nvtx.annotate("norm_1", color="green"):
                h = norm(x)

            with nvtx.annotate("qkv_proj", color="red"):
                qkv = qkv_proj(h)

            with nvtx.annotate("o_proj", color="red"):
                attn_out = o_proj(qkv[:, :, :hidden])

            with nvtx.annotate("residual_1", color="gray"):
                h = x + attn_out

            with nvtx.annotate("norm_2", color="green"):
                h_norm = norm(h)

            with nvtx.annotate("gate_proj", color="purple"):
                gate = gate_proj(h_norm)

            with nvtx.annotate("up_proj", color="purple"):
                up = up_proj(h_norm)

            with nvtx.annotate("silu_mul", color="pink"):
                mlp_h = torch.nn.functional.silu(gate) * up

            with nvtx.annotate("down_proj", color="purple"):
                mlp_out = down_proj(mlp_h)

            with nvtx.annotate("residual_2", color="gray"):
                x = h + mlp_out

    torch.cuda.synchronize()

profile_transformer_layer()
# Capture
nsys profile --trace=cuda,nvtx \
    --output=transformer_layer \
    python inference_profile.py

Step 2: Analyze Timeline Results

def analyze_timeline_results():
    """Typical findings from Nsight Systems timeline."""
    findings = {
        "QKV Projection GEMM": {
            "time_us": 85,
            "pct_of_layer": "22%",
            "verdict": "Expected -- large GEMM, compute-bound in prefill",
        },
        "Gate/Up Projection GEMMs": {
            "time_us": 135,
            "pct_of_layer": "35%",
            "verdict": "Largest cost -- MLP projections are 2.7x wider "
                      "than hidden dim (11008 vs 4096)",
        },
        "Down Projection GEMM": {
            "time_us": 68,
            "pct_of_layer": "18%",
            "verdict": "Expected -- GEMM from 11008 to 4096",
        },
        "O Projection GEMM": {
            "time_us": 32,
            "pct_of_layer": "8%",
            "verdict": "Expected -- smallest GEMM (4096 to 4096)",
        },
        "LayerNorm x2": {
            "time_us": 18,
            "pct_of_layer": "5%",
            "verdict": "Small but non-negligible. Memory-bound.",
        },
        "Residual adds + SiLU*up": {
            "time_us": 15,
            "pct_of_layer": "4%",
            "verdict": "Elementwise ops -- should be fused",
        },
        "CPU gaps between kernels": {
            "time_us": 30,
            "pct_of_layer": "8%",
            "verdict": "PROBLEM: 30 us of GPU idle time per layer. "
                      "At 80 layers, this is 2.4 ms total.",
        },
    }

    total = sum(f["time_us"] for f in findings.values())
    print(f"Total layer time: {total} us\n")
    for name, info in findings.items():
        print(f"{name}: {info['time_us']} us ({info['pct_of_layer']})")
        print(f"  -> {info['verdict']}\n")

analyze_timeline_results()

Step 3: Nsight Compute on Key Kernels

# Profile the MLP GEMMs specifically
ncu --kernel-name ".*gemm.*" \
    --set full \
    --launch-skip 20 --launch-count 10 \
    --output mlp_gemm_analysis \
    python inference_profile.py

Step 4: Interpret and Optimize

def optimization_actions():
    """Actions based on profiling results."""
    actions = [
        {
            "finding": "CPU gaps: 30 us/layer (8% of layer time)",
            "root_cause": "Python overhead + kernel launch latency",
            "action": "Use CUDA graphs to capture and replay the layer",
            "expected_improvement": "Eliminate ~90% of CPU gaps: 27 us saved",
        },
        {
            "finding": "SiLU*up not fused with gate_proj",
            "root_cause": "Separate kernel for activation function",
            "action": "Use torch.compile or fused SiLU-mul kernel",
            "expected_improvement": "~5 us saved (eliminate HBM round-trip)",
        },
        {
            "finding": "Gate and Up projections launched sequentially",
            "root_cause": "Same CUDA stream, no parallelism",
            "action": "Launch gate_proj and up_proj on separate streams",
            "expected_improvement": "~40 us saved if they can overlap",
        },
        {
            "finding": "MLP GEMM at 75% SOL compute throughput",
            "root_cause": "Nsight Compute shows suboptimal tile size "
                         "selection by cuBLAS",
            "action": "Use CUTLASS with tuned tile sizes, or "
                     "cublasLt with heuristic search",
            "expected_improvement": "Potentially 10-15% GEMM speedup",
        },
    ]

    total_savings = 0
    for a in actions:
        print(f"Finding: {a['finding']}")
        print(f"  Cause:  {a['root_cause']}")
        print(f"  Action: {a['action']}")
        print(f"  Gain:   {a['expected_improvement']}")
        print()

optimization_actions()

Transformer Layer Optimization Progress

(us per layer)
Baseline 100%
383 us per layer
+ CUDA Graphs 93%
356 us per layer
+ Fused SiLU 92%
351 us per layer
+ Parallel streams 81%
311 us per layer
+ Tuned GEMM 74%
285 us per layer

Common Profiling Patterns

Pattern: Memory-Bound Kernel with Low Achieved Bandwidth

def diagnose_low_bandwidth():
    """Diagnosis tree for memory-bound kernels below 70% SOL."""
    print("Achieved HBM BW < 70% of peak:")
    print("  1. Check coalescing (L1 sector efficiency)")
    print("     -> If < 100%: uncoalesced access pattern")
    print("     -> Fix: restructure data layout, use vectorized loads")
    print()
    print("  2. Check L2 hit rate")
    print("     -> If high: data fits in L2, HBM BW is not the bottleneck")
    print("     -> The kernel may be L2-bandwidth or compute limited")
    print()
    print("  3. Check occupancy")
    print("     -> If < 50%: not enough warps to saturate memory pipeline")
    print("     -> Fix: reduce registers per thread, reduce shared mem")
    print()
    print("  4. Check stall reasons")
    print("     -> long_scoreboard high: memory latency not hidden")
    print("     -> Fix: increase ILP, prefetch data, increase occupancy")

Pattern: Compute-Bound Kernel with Low Tensor Core Utilization

def diagnose_low_tensor_core_util():
    """Diagnosis for GEMM kernel with tensor cores under 60%."""
    print("Tensor core utilization < 60%:")
    print("  1. Check matrix dimensions")
    print("     -> M, N, K not multiples of 16: padding overhead")
    print("     -> Small M (decode): not enough work to fill SMs")
    print()
    print("  2. Check tile size vs problem size")
    print("     -> Large tiles + small matrix = low SM occupancy")
    print("     -> Fix: auto-tune tile sizes, use cublasLt heuristic")
    print()
    print("  3. Check data loading stalls")
    print("     -> Tensor cores idle while waiting for data from SMEM")
    print("     -> Fix: increase pipeline depth (multi-stage)")
    print()
    print("  4. Check if epilogue dominates")
    print("     -> Non-trivial epilogue (GELU, residual) adds cycles")
    print("     -> These cycles are not tensor core cycles")

Automated Profiling with Python

Extracting Metrics Programmatically

import subprocess
import json

def run_ncu_and_extract(script_path, kernel_name, metrics):
    """Run Nsight Compute and extract specific metrics."""
    metric_flags = ",".join(metrics)
    cmd = [
        "ncu",
        "--kernel-name", kernel_name,
        "--launch-count", "3",
        "--metrics", metric_flags,
        "--csv",
        "python", script_path,
    ]

    result = subprocess.run(cmd, capture_output=True, text=True)
    return result.stdout

def performance_regression_test(baseline_file, current_metrics):
    """Compare current kernel performance against baseline."""
    # Load baseline
    with open(baseline_file) as f:
        baseline = json.load(f)

    regressions = []
    for metric_name, baseline_val in baseline.items():
        current_val = current_metrics.get(metric_name, 0)

        if isinstance(baseline_val, (int, float)) and baseline_val > 0:
            ratio = current_val / baseline_val
            if ratio < 0.95:  # 5% regression threshold
                regressions.append({
                    'metric': metric_name,
                    'baseline': baseline_val,
                    'current': current_val,
                    'regression': f"{(1-ratio)*100:.1f}%",
                })

    if regressions:
        print("PERFORMANCE REGRESSIONS DETECTED:")
        for r in regressions:
            print(f"  {r['metric']}: {r['baseline']} -> {r['current']} "
                  f"({r['regression']} slower)")
    else:
        print("No regressions detected.")

    return regressions
💡 CI Integration

Add Nsight Compute profiling to your CI pipeline for kernel-critical code. Capture baseline metrics for key kernels, and fail the build if any metric regresses beyond a threshold. This catches performance regressions before they reach production. Use --replay-mode application for deterministic results.

Summary

The profiling workflow is: Nsight Systems first (system-wide timeline to identify bottleneck kernels and CPU/GPU gaps), then Nsight Compute on specific kernels (SOL analysis, memory chart, roofline placement, stall reasons). The SOL metric tells you immediately whether a kernel is compute-bound or memory-bound. Memory-bound kernels below 70% SOL need coalescing improvement, occupancy tuning, or traffic reduction. Compute-bound kernels below 70% SOL need better tile sizing, higher tensor core utilization, or pipeline optimization.

The worked example shows the typical optimization progression: CUDA graphs eliminate CPU gaps (8% of layer time), kernel fusion eliminates elementwise HBM round-trips (4%), stream parallelism overlaps independent operations (10%), and GEMM tuning improves compute throughput (5-10%). Combined, these bring a transformer layer from 40% to 85% of theoretical peak, which is the practical ceiling for production code.