You optimize a CUDA kernel for two hours, improve occupancy from 50% to 100%, and throughput stays flat. You spent two hours optimizing the wrong thing. Nsight Compute would have told you in 30 seconds that the kernel is memory-bound at 95% of peak bandwidth — occupancy is irrelevant. The difference between guessing and profiling is the difference between random changes and systematic improvement. Nsight Systems identifies which kernels consume wall-clock time. Nsight Compute identifies why those kernels are slow. Used together in the correct order, they reduce optimization time from days to hours and eliminate wasted effort on irrelevant metrics.
This post covers the complete profiling workflow: tool selection, capture methodology, metric interpretation, and a worked example optimizing a transformer layer from 40% to 85% of the speed-of-light (SOL) target.
Tool Selection: Nsight Systems vs Nsight Compute
When to Use Which
Nsight Systems vs Nsight Compute: Feature Comparison
| Capability | Nsight Systems | Nsight Compute |
|---|---|---|
| Scope | Entire application | Single kernel |
| Timeline view | Yes (CPU + GPU + NCCL) | No |
| CPU profiling | Yes (call stacks) | No |
| GPU kernel timing | Yes (wall-clock) | Yes (cycle-accurate) |
| Memory throughput | No | Yes (L1, L2, HBM breakdown) |
| Compute utilization | No | Yes (per-pipe: FMA, tensor, SFU) |
| Roofline analysis | No | Yes |
| Occupancy analysis | No | Yes (theoretical + achieved) |
| Stall reasons | No | Yes (warp stall breakdown) |
| Overhead | Low (1-5%) | High (10-100x per kernel) |
| NCCL / multi-GPU | Yes | No |
The Profiling Workflow
1. Nsight Systems: Capture full timeline
-> Identify: Which kernels take the most time?
-> Identify: Are there CPU/GPU gaps (GPU idle)?
-> Identify: Is NCCL communication overlapped?
2. Nsight Compute: Profile the top-N slowest kernels
-> For each kernel: Is it compute-bound or memory-bound?
-> If memory-bound: What is the achieved bandwidth?
-> If compute-bound: What is the tensor core utilization?
-> What are the stall reasons?
3. Fix: Apply targeted optimization
-> Memory-bound: improve coalescing, reduce bank conflicts
-> Compute-bound: increase occupancy, improve instruction mix
-> CPU-bound: reduce kernel launch overhead, use CUDA graphs
4. Verify: Re-profile to confirm improvement
Nsight Systems: Capturing and Reading Timelines
Capture Commands
# Basic capture: profile entire application
nsys profile --trace=cuda,nvtx,osrt \
--output=transformer_profile \
python inference.py
# Capture with GPU metrics (SM activity, memory throughput)
nsys profile --trace=cuda,nvtx,osrt \
--gpu-metrics-device=all \
--output=transformer_profile_gpu \
python inference.py
# Capture with NCCL communication (multi-GPU)
nsys profile --trace=cuda,nvtx,osrt,nccl \
--output=multi_gpu_profile \
torchrun --nproc_per_node=8 inference.py
# Delayed capture (skip warmup, capture only steady state)
nsys profile --trace=cuda,nvtx \
--delay=5 --duration=10 \
--output=steady_state \
python server.py
# Export to JSON for programmatic analysis
nsys stats --format=json transformer_profile.nsys-rep \
> profile_stats.json
NVTX Annotations for Context
Add NVTX markers to your code so the Nsight Systems timeline shows meaningful labels:
import torch
import nvtx # pip install nvtx
class ProfiledTransformerLayer:
"""Transformer layer with NVTX annotations for Nsight Systems."""
def __init__(self, layer, layer_idx):
self.layer = layer
self.layer_idx = layer_idx
@nvtx.annotate(f"layer_forward", color="blue")
def forward(self, hidden_states, kv_cache=None):
with nvtx.annotate(f"L{self.layer_idx}_rmsnorm_1", color="green"):
residual = hidden_states
hidden_states = self.layer.input_layernorm(hidden_states)
with nvtx.annotate(f"L{self.layer_idx}_attention", color="red"):
with nvtx.annotate("qkv_proj", color="orange"):
qkv = self.layer.self_attn.qkv_proj(hidden_states)
with nvtx.annotate("rope", color="yellow"):
q, k = apply_rotary_pos_emb(qkv)
with nvtx.annotate("flash_attn", color="red"):
attn_out = flash_attention(q, k, v, kv_cache)
with nvtx.annotate("o_proj", color="orange"):
attn_out = self.layer.self_attn.o_proj(attn_out)
with nvtx.annotate(f"L{self.layer_idx}_residual_1", color="gray"):
hidden_states = residual + attn_out
with nvtx.annotate(f"L{self.layer_idx}_rmsnorm_2", color="green"):
residual = hidden_states
hidden_states = self.layer.post_attention_layernorm(hidden_states)
with nvtx.annotate(f"L{self.layer_idx}_mlp", color="purple"):
with nvtx.annotate("gate_up_proj", color="pink"):
gate = self.layer.mlp.gate_proj(hidden_states)
up = self.layer.mlp.up_proj(hidden_states)
with nvtx.annotate("silu_mul", color="pink"):
hidden_states = torch.nn.functional.silu(gate) * up
with nvtx.annotate("down_proj", color="pink"):
hidden_states = self.layer.mlp.down_proj(hidden_states)
with nvtx.annotate(f"L{self.layer_idx}_residual_2", color="gray"):
hidden_states = residual + hidden_states
return hidden_states
Reading the Timeline
def interpret_nsys_timeline():
"""Key things to look for in a Nsight Systems timeline."""
checks = {
"GPU Idle Gaps": {
"what": "Periods where no GPU kernel is running",
"cause": "CPU overhead: Python, framework scheduling, "
"kernel launch latency",
"fix": "CUDA graphs, reduce Python overhead, "
"torch.compile, larger batches",
},
"Kernel Serialization": {
"what": "Kernels running one after another on the same stream "
"when they could overlap",
"cause": "All kernels on default stream, or unnecessary "
"synchronization",
"fix": "Use multiple CUDA streams for independent operations",
},
"NCCL Gaps": {
"what": "GPU idle while waiting for NCCL collective",
"cause": "Communication not overlapped with computation",
"fix": "Pipeline parallelism, overlap allreduce with "
"next layer's computation",
},
"Small Kernels": {
"what": "Many tiny kernels (under 10 us each)",
"cause": "Unfused operations, each launching a separate kernel",
"fix": "Kernel fusion (torch.compile, custom kernels)",
},
"Memory Transfer": {
"what": "cudaMemcpy operations visible in timeline",
"cause": "CPU-GPU data transfer (model loading, KV cache swap)",
"fix": "Pinned memory, async transfers, prefetch",
},
}
for name, info in checks.items():
print(f"\n--- {name} ---")
for key, val in info.items():
print(f" {key}: {val}")
Nsight Compute: Per-Kernel Deep Analysis
Capture Commands
# Profile a specific kernel (by name regex)
ncu --kernel-name "flash_fwd_kernel" \
--set full \
--output flash_attn_profile \
python inference.py
# Profile all kernels (warning: very slow for large models)
ncu --set full \
--output all_kernels \
python inference.py
# Profile with roofline data
ncu --set roofline \
--kernel-name ".*gemm.*" \
--output gemm_roofline \
python inference.py
# Limit number of kernel launches profiled
ncu --kernel-name ".*" \
--launch-count 5 \
--set full \
--output quick_profile \
python inference.py
# Compare two kernel implementations
ncu --set full --output kernel_v1 python run_v1.py
ncu --set full --output kernel_v2 python run_v2.py
# Open both in Nsight Compute GUI for side-by-side comparison
Key Metrics and What They Mean
def ncu_metric_interpretation():
"""Key Nsight Compute metrics and their interpretation."""
metrics = {
"SOL (Speed of Light)": {
"section": "GPU Speed of Light Throughput",
"metrics": {
"sm__throughput.avg.pct_of_peak_sustained_elapsed":
"Compute throughput as % of peak. "
">80% = compute-bound, <50% = likely memory-bound",
"gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed":
"DRAM (HBM) throughput as % of peak. "
">80% = memory-bound, <50% = other bottleneck",
},
},
"Memory Workload Analysis": {
"section": "Memory Workload Analysis",
"metrics": {
"l1tex__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_ld.sum":
"Bytes loaded from L2 to L1 (L1 misses). "
"Compare to ideal (coalesced) to find inefficiency",
"dram__bytes_read.sum":
"Total HBM bytes read. Compare to "
"theoretical minimum for the algorithm",
"l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum":
"Shared memory bank conflicts. "
"Non-zero = shared memory access pattern issue",
},
},
"Compute Workload Analysis": {
"section": "Compute Workload Analysis",
"metrics": {
"sm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_elapsed":
"Tensor core utilization. For GEMM kernels, "
"this should be >60%",
"sm__pipe_fma_cycles_active.avg.pct_of_peak_sustained_elapsed":
"FMA pipe utilization. High = CUDA core compute-bound",
},
},
"Occupancy": {
"section": "Occupancy",
"metrics": {
"sm__warps_active.avg.pct_of_peak_sustained_active":
"Achieved occupancy. Low occupancy = "
"not enough warps to hide latency",
"launch__occupancy_limit_registers":
"Occupancy limited by register usage. "
"Consider --maxrregcount",
"launch__occupancy_limit_shared_mem":
"Occupancy limited by shared memory. "
"Consider reducing tile size",
},
},
"Warp Stalls": {
"section": "Warp State Statistics",
"metrics": {
"smsp__warps_issue_stalled_long_scoreboard_per_warp_active.ratio":
"Stalled waiting for global memory (L2/HBM). "
"High = memory latency issue",
"smsp__warps_issue_stalled_math_pipe_throttle_per_warp_active.ratio":
"Stalled because math pipes are full. "
"High = compute-bound (good!)",
"smsp__warps_issue_stalled_wait_per_warp_active.ratio":
"Stalled waiting for synchronization "
"(__syncthreads, barrier). High = load imbalance",
"smsp__warps_issue_stalled_short_scoreboard_per_warp_active.ratio":
"Stalled on shared memory or L1 cache. "
"High = shared memory latency (bank conflicts?)",
},
},
}
for section_name, section_data in metrics.items():
print(f"\n=== {section_name} ===")
print(f" Section: {section_data['section']}")
for metric, desc in section_data['metrics'].items():
print(f" {metric}:")
print(f" {desc}")
The Roofline Model in Nsight Compute
Interpreting the Roofline Plot
The roofline model plots a kernel’s position based on its arithmetic intensity (FLOPs per byte of memory traffic) and achieved performance (FLOPS or bytes/sec).
def roofline_analysis(kernel_flops, kernel_bytes, kernel_time_s,
peak_flops=990e12, peak_bw=3350e9):
"""Analyze a kernel's position on the roofline model.
Args:
kernel_flops: Total FLOPs executed by the kernel
kernel_bytes: Total bytes transferred (HBM) by the kernel
kernel_time_s: Kernel execution time in seconds
peak_flops: GPU peak FLOPS (H100: 990 TFLOPS FP16)
peak_bw: GPU peak HBM bandwidth (H100: 3350 GB/s)
"""
# Arithmetic intensity: FLOPs per byte
arith_intensity = kernel_flops / kernel_bytes
# Achieved performance
achieved_flops = kernel_flops / kernel_time_s
achieved_bw = kernel_bytes / kernel_time_s
# Ridge point: where memory-bound transitions to compute-bound
ridge_point = peak_flops / peak_bw # FLOPs/byte
# Roofline ceiling at this arithmetic intensity
roofline_ceiling = min(peak_flops, peak_bw * arith_intensity)
# SOL (speed of light) percentage
sol_compute = (achieved_flops / peak_flops) * 100
sol_memory = (achieved_bw / peak_bw) * 100
# Classification
if arith_intensity < ridge_point:
classification = "MEMORY-BOUND"
sol_target = sol_memory
limiting = f"HBM bandwidth ({achieved_bw/1e9:.0f} / "
limiting += f"{peak_bw/1e9:.0f} GB/s = {sol_memory:.1f}%)"
else:
classification = "COMPUTE-BOUND"
sol_target = sol_compute
limiting = f"Compute ({achieved_flops/1e12:.0f} / "
limiting += f"{peak_flops/1e12:.0f} TFLOPS = {sol_compute:.1f}%)"
print(f"=== Roofline Analysis ===")
print(f"Arithmetic intensity: {arith_intensity:.1f} FLOPs/byte")
print(f"Ridge point: {ridge_point:.1f} FLOPs/byte")
print(f"Classification: {classification}")
print(f"Limiting resource: {limiting}")
print(f"Achieved vs ceiling: "
f"{achieved_flops/1e12:.1f} / {roofline_ceiling/1e12:.1f} TFLOPS "
f"= {(achieved_flops/roofline_ceiling)*100:.1f}%")
return {
'arith_intensity': arith_intensity,
'classification': classification,
'sol_compute': sol_compute,
'sol_memory': sol_memory,
}
Typical Kernel Classifications
Roofline Classification of Transformer Operations (H100)
| Operation | Arith. Intensity | Classification | SOL Target | Typical Achieved |
|---|---|---|---|---|
| GEMM (prefill, M=2048) | 256 FLOP/B | Compute-bound | 990 TFLOPS | 700-850 TFLOPS (71-86%) |
| GEMM (decode, M=1) | 1 FLOP/B | Memory-bound | 3350 GB/s | 2500-3000 GB/s (75-90%) |
| FlashAttention (long ctx) | ~64 FLOP/B | Compute-bound | 990 TFLOPS | 500-700 TFLOPS (50-71%) |
| RMSNorm | ~4 FLOP/B | Memory-bound | 3350 GB/s | 2000-2800 GB/s (60-84%) |
| Bias + GELU | ~6 FLOP/B | Memory-bound | 3350 GB/s | 2200-2900 GB/s (66-87%) |
| Residual add | ~1 FLOP/B | Memory-bound | 3350 GB/s | 2500-3100 GB/s (75-93%) |
| Softmax | ~5 FLOP/B | Memory-bound | 3350 GB/s | 1800-2500 GB/s (54-75%) |
Worked Example: Profiling a Transformer Layer
Step 1: Nsight Systems Capture
# inference_profile.py - Script to profile
import torch
import nvtx
def profile_transformer_layer():
device = 'cuda'
batch = 1
seq_len = 2048
hidden = 4096
heads = 32
head_dim = hidden // heads
# Simulate transformer layer components
qkv_proj = torch.nn.Linear(hidden, 3 * hidden, device=device,
dtype=torch.float16)
o_proj = torch.nn.Linear(hidden, hidden, device=device,
dtype=torch.float16)
gate_proj = torch.nn.Linear(hidden, 11008, device=device,
dtype=torch.float16)
up_proj = torch.nn.Linear(hidden, 11008, device=device,
dtype=torch.float16)
down_proj = torch.nn.Linear(11008, hidden, device=device,
dtype=torch.float16)
norm = torch.nn.LayerNorm(hidden, device=device, dtype=torch.float16)
x = torch.randn(batch, seq_len, hidden, device=device,
dtype=torch.float16)
# Warmup
for _ in range(10):
_ = qkv_proj(norm(x))
# Profile region
torch.cuda.synchronize()
with nvtx.annotate("transformer_layer", color="blue"):
for _ in range(100):
with nvtx.annotate("norm_1", color="green"):
h = norm(x)
with nvtx.annotate("qkv_proj", color="red"):
qkv = qkv_proj(h)
with nvtx.annotate("o_proj", color="red"):
attn_out = o_proj(qkv[:, :, :hidden])
with nvtx.annotate("residual_1", color="gray"):
h = x + attn_out
with nvtx.annotate("norm_2", color="green"):
h_norm = norm(h)
with nvtx.annotate("gate_proj", color="purple"):
gate = gate_proj(h_norm)
with nvtx.annotate("up_proj", color="purple"):
up = up_proj(h_norm)
with nvtx.annotate("silu_mul", color="pink"):
mlp_h = torch.nn.functional.silu(gate) * up
with nvtx.annotate("down_proj", color="purple"):
mlp_out = down_proj(mlp_h)
with nvtx.annotate("residual_2", color="gray"):
x = h + mlp_out
torch.cuda.synchronize()
profile_transformer_layer()
# Capture
nsys profile --trace=cuda,nvtx \
--output=transformer_layer \
python inference_profile.py
Step 2: Analyze Timeline Results
def analyze_timeline_results():
"""Typical findings from Nsight Systems timeline."""
findings = {
"QKV Projection GEMM": {
"time_us": 85,
"pct_of_layer": "22%",
"verdict": "Expected -- large GEMM, compute-bound in prefill",
},
"Gate/Up Projection GEMMs": {
"time_us": 135,
"pct_of_layer": "35%",
"verdict": "Largest cost -- MLP projections are 2.7x wider "
"than hidden dim (11008 vs 4096)",
},
"Down Projection GEMM": {
"time_us": 68,
"pct_of_layer": "18%",
"verdict": "Expected -- GEMM from 11008 to 4096",
},
"O Projection GEMM": {
"time_us": 32,
"pct_of_layer": "8%",
"verdict": "Expected -- smallest GEMM (4096 to 4096)",
},
"LayerNorm x2": {
"time_us": 18,
"pct_of_layer": "5%",
"verdict": "Small but non-negligible. Memory-bound.",
},
"Residual adds + SiLU*up": {
"time_us": 15,
"pct_of_layer": "4%",
"verdict": "Elementwise ops -- should be fused",
},
"CPU gaps between kernels": {
"time_us": 30,
"pct_of_layer": "8%",
"verdict": "PROBLEM: 30 us of GPU idle time per layer. "
"At 80 layers, this is 2.4 ms total.",
},
}
total = sum(f["time_us"] for f in findings.values())
print(f"Total layer time: {total} us\n")
for name, info in findings.items():
print(f"{name}: {info['time_us']} us ({info['pct_of_layer']})")
print(f" -> {info['verdict']}\n")
analyze_timeline_results()
Step 3: Nsight Compute on Key Kernels
# Profile the MLP GEMMs specifically
ncu --kernel-name ".*gemm.*" \
--set full \
--launch-skip 20 --launch-count 10 \
--output mlp_gemm_analysis \
python inference_profile.py
Step 4: Interpret and Optimize
def optimization_actions():
"""Actions based on profiling results."""
actions = [
{
"finding": "CPU gaps: 30 us/layer (8% of layer time)",
"root_cause": "Python overhead + kernel launch latency",
"action": "Use CUDA graphs to capture and replay the layer",
"expected_improvement": "Eliminate ~90% of CPU gaps: 27 us saved",
},
{
"finding": "SiLU*up not fused with gate_proj",
"root_cause": "Separate kernel for activation function",
"action": "Use torch.compile or fused SiLU-mul kernel",
"expected_improvement": "~5 us saved (eliminate HBM round-trip)",
},
{
"finding": "Gate and Up projections launched sequentially",
"root_cause": "Same CUDA stream, no parallelism",
"action": "Launch gate_proj and up_proj on separate streams",
"expected_improvement": "~40 us saved if they can overlap",
},
{
"finding": "MLP GEMM at 75% SOL compute throughput",
"root_cause": "Nsight Compute shows suboptimal tile size "
"selection by cuBLAS",
"action": "Use CUTLASS with tuned tile sizes, or "
"cublasLt with heuristic search",
"expected_improvement": "Potentially 10-15% GEMM speedup",
},
]
total_savings = 0
for a in actions:
print(f"Finding: {a['finding']}")
print(f" Cause: {a['root_cause']}")
print(f" Action: {a['action']}")
print(f" Gain: {a['expected_improvement']}")
print()
optimization_actions()
Transformer Layer Optimization Progress
(us per layer)Common Profiling Patterns
Pattern: Memory-Bound Kernel with Low Achieved Bandwidth
def diagnose_low_bandwidth():
"""Diagnosis tree for memory-bound kernels below 70% SOL."""
print("Achieved HBM BW < 70% of peak:")
print(" 1. Check coalescing (L1 sector efficiency)")
print(" -> If < 100%: uncoalesced access pattern")
print(" -> Fix: restructure data layout, use vectorized loads")
print()
print(" 2. Check L2 hit rate")
print(" -> If high: data fits in L2, HBM BW is not the bottleneck")
print(" -> The kernel may be L2-bandwidth or compute limited")
print()
print(" 3. Check occupancy")
print(" -> If < 50%: not enough warps to saturate memory pipeline")
print(" -> Fix: reduce registers per thread, reduce shared mem")
print()
print(" 4. Check stall reasons")
print(" -> long_scoreboard high: memory latency not hidden")
print(" -> Fix: increase ILP, prefetch data, increase occupancy")
Pattern: Compute-Bound Kernel with Low Tensor Core Utilization
def diagnose_low_tensor_core_util():
"""Diagnosis for GEMM kernel with tensor cores under 60%."""
print("Tensor core utilization < 60%:")
print(" 1. Check matrix dimensions")
print(" -> M, N, K not multiples of 16: padding overhead")
print(" -> Small M (decode): not enough work to fill SMs")
print()
print(" 2. Check tile size vs problem size")
print(" -> Large tiles + small matrix = low SM occupancy")
print(" -> Fix: auto-tune tile sizes, use cublasLt heuristic")
print()
print(" 3. Check data loading stalls")
print(" -> Tensor cores idle while waiting for data from SMEM")
print(" -> Fix: increase pipeline depth (multi-stage)")
print()
print(" 4. Check if epilogue dominates")
print(" -> Non-trivial epilogue (GELU, residual) adds cycles")
print(" -> These cycles are not tensor core cycles")
Automated Profiling with Python
Extracting Metrics Programmatically
import subprocess
import json
def run_ncu_and_extract(script_path, kernel_name, metrics):
"""Run Nsight Compute and extract specific metrics."""
metric_flags = ",".join(metrics)
cmd = [
"ncu",
"--kernel-name", kernel_name,
"--launch-count", "3",
"--metrics", metric_flags,
"--csv",
"python", script_path,
]
result = subprocess.run(cmd, capture_output=True, text=True)
return result.stdout
def performance_regression_test(baseline_file, current_metrics):
"""Compare current kernel performance against baseline."""
# Load baseline
with open(baseline_file) as f:
baseline = json.load(f)
regressions = []
for metric_name, baseline_val in baseline.items():
current_val = current_metrics.get(metric_name, 0)
if isinstance(baseline_val, (int, float)) and baseline_val > 0:
ratio = current_val / baseline_val
if ratio < 0.95: # 5% regression threshold
regressions.append({
'metric': metric_name,
'baseline': baseline_val,
'current': current_val,
'regression': f"{(1-ratio)*100:.1f}%",
})
if regressions:
print("PERFORMANCE REGRESSIONS DETECTED:")
for r in regressions:
print(f" {r['metric']}: {r['baseline']} -> {r['current']} "
f"({r['regression']} slower)")
else:
print("No regressions detected.")
return regressions
Add Nsight Compute profiling to your CI pipeline for kernel-critical code. Capture baseline metrics for key kernels, and fail the build if any metric regresses beyond a threshold. This catches performance regressions before they reach production. Use --replay-mode application for deterministic results.
Summary
The profiling workflow is: Nsight Systems first (system-wide timeline to identify bottleneck kernels and CPU/GPU gaps), then Nsight Compute on specific kernels (SOL analysis, memory chart, roofline placement, stall reasons). The SOL metric tells you immediately whether a kernel is compute-bound or memory-bound. Memory-bound kernels below 70% SOL need coalescing improvement, occupancy tuning, or traffic reduction. Compute-bound kernels below 70% SOL need better tile sizing, higher tensor core utilization, or pipeline optimization.
The worked example shows the typical optimization progression: CUDA graphs eliminate CPU gaps (8% of layer time), kernel fusion eliminates elementwise HBM round-trips (4%), stream parallelism overlaps independent operations (10%), and GEMM tuning improves compute throughput (5-10%). Combined, these bring a transformer layer from 40% to 85% of theoretical peak, which is the practical ceiling for production code.