Part of Series GPU Hardware & AI Accelerators 11 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

An H100 SM partition contains a single warp scheduler that issues one instruction per clock cycle. With four partitions per SM, each SM issues at most four instructions per cycle. At 1.83 GHz boost clock, that is 4×1.83×109=7.324 \times 1.83 \times 10^9 = 7.32 billion instructions per second per SM, or 7.32×132=9667.32 \times 132 = 966 billion instructions per second across the full GPU. The scheduler’s job is to select, every single cycle, which warp to run next, from a pool of up to 12 resident warps per partition. When a warp stalls on a memory load that takes 500 cycles, the scheduler immediately switches to another ready warp at zero cost. This is the entire GPU execution model: hide latency by having far more work in flight than execution units.

This post dissects the warp scheduler microarchitecture, the scoreboard mechanism that tracks instruction dependencies, the scheduling policy, dual-issue rules on older architectures, the common stall reasons that Nsight Compute reports, and the arithmetic for determining how many warps you actually need to fully hide memory latency.

The Warp Scheduler Pipeline

From Instruction Fetch to Execution

Each warp scheduler operates a multi-stage pipeline:

  1. Instruction fetch: The scheduler reads the next instruction from the I-cache using the warp’s program counter (PC).
  2. Decode: The instruction is decoded to determine operands, execution unit, and latency.
  3. Scoreboard check: The scheduler verifies that all source operands are available (not pending from a prior instruction).
  4. Operand collection: Register values are read from the register file.
  5. Dispatch: The instruction is sent to the appropriate execution unit (INT32, FP32, FP64, tensor core, LD/ST, SFU).

The scheduler can issue one new instruction every cycle, but the instruction it issues may take multiple cycles to produce its result. The key invariant is that the scheduler never stalls itself — if the current warp is not ready, it picks a different warp.

Cycle 1:  Scheduler checks Warp 0 → READY → Issues FMA to FP32 pipe
Cycle 2:  Scheduler checks Warp 1 → READY → Issues LOAD to LD/ST
Cycle 3:  Scheduler checks Warp 2 → READY → Issues FMA to FP32 pipe
Cycle 4:  Scheduler checks Warp 0 → STALLED (waiting for prior result) → skip
          Scheduler checks Warp 3 → READY → Issues SIN to SFU
...
Cycle 502: Warp 1's LOAD completes → Warp 1 becomes READY again

The scheduler examines warps in priority order, not round-robin. If Warp 0 becomes ready again before Warp 3, the scheduler issues from Warp 0.

The Scoreboard

The scoreboard is a hardware structure that tracks which registers have pending writes. When an instruction is issued, the destination register is marked “pending” in the scoreboard. When the instruction completes (the result is written back to the register file), the pending bit is cleared.

A warp is eligible for issue only when:

  • Its next instruction’s source registers have no pending writes (all operands available)
  • The required execution unit (FP32, LD/ST, SFU, tensor core) is not fully occupied
  • The warp is not waiting on a barrier (__syncthreads(), __syncwarp())
📊

Scoreboard Stall Example

CycleActionScoreboard State for Warp 0
1 Issue: R4 = LOAD [addr] R4 = PENDING
2 Issue: R5 = R1 + R2 (no dependency) R4 = PENDING, R5 = PENDING
3 Check: R6 = R4 * R3 → R4 PENDING → STALL R4 = PENDING, R5 = done
4-500 Warp 0 stalled. Other warps issue. R4 = PENDING
501 LOAD completes, R4 written back R4 = done
502 Issue: R6 = R4 * R3 → R4 ready → GO R6 = PENDING
Note: The scoreboard introduces zero overhead for switching warps. The stalled warp stays in the eligible pool but is simply skipped each cycle.

Dependency Tracking Granularity

The scoreboard tracks dependencies at the register level, not the instruction level. If instruction A writes R4 and instruction B reads R7, B can issue in the same cycle that A completes — there is no pipeline bubble. The only stall condition is a true data dependency: a read-after-write (RAW) hazard on the same register.

Write-after-read (WAR) and write-after-write (WAW) hazards are handled by the in-order nature of warp execution. Within a single warp, instructions issue in program order. The compiler (ptxas) schedules instructions to maximize the distance between a write and its dependent read — this is the primary optimization the CUDA compiler performs at the SASS level.

Scheduling Policy

Greedy-Then-Oldest (GTO)

NVIDIA does not publish the exact scheduling policy, but extensive microbenchmarking (Jia et al., 2018; Volkov, 2016) has established that modern GPUs use a policy close to Greedy-Then-Oldest (GTO):

  1. The scheduler preferentially re-issues from the same warp that issued last cycle, if that warp is still ready. This improves temporal locality — consecutive instructions from the same warp likely access the same registers and cache lines.
  2. If the current warp stalls, the scheduler selects the oldest ready warp (the one that has waited longest since its last issue).

GTO is a good fit for GPU workloads because:

  • Issuing consecutive instructions from the same warp maximizes register reuse and reduces operand collector pressure.
  • Falling back to oldest-first prevents starvation — no warp waits indefinitely.
// GTO scheduling trace (4 warps, 1 scheduler)
Cycle 1: Warp 0 issues (GTO: prefer current)
Cycle 2: Warp 0 issues (still ready, GTO stays)
Cycle 3: Warp 0 issues
Cycle 4: Warp 0 stalls (scoreboard: waiting on LOAD)
          → Select oldest ready → Warp 1 (waited 3 cycles)
Cycle 5: Warp 1 issues (GTO: prefer current = Warp 1)
Cycle 6: Warp 1 issues
Cycle 7: Warp 1 stalls → Select oldest → Warp 2
...
ℹ️ Scheduling Policy Varies by Architecture

Volta and Turing use an independent thread scheduling model where threads within a warp can diverge and reconverge at fine granularity. The scheduler on these architectures tracks per-thread rather than per-warp state in certain situations. Hopper refines this further with warp group scheduling for WGMMA instructions, where 4 warps (128 threads) are scheduled as a unit for tensor core operations.

Round-Robin vs GTO: Why It Matters

Consider a kernel where each warp performs a sequence: LOAD (500 cycles), then 10 FMAs (1 cycle each), then STORE (500 cycles). With 8 warps:

Round-Robin: Issues from Warp 0, then Warp 1, then Warp 2, … Each warp gets one instruction, then waits 7 cycles before its next turn. The 10 FMAs that could execute back-to-back are spread across 70 cycles.

GTO: Issues all 10 FMAs from Warp 0 in 10 consecutive cycles, then hits the STORE and switches. The FMA sequence completes 7x faster for each warp, and the register file has better locality because the same warp’s registers are accessed repeatedly.

Instruction Latency and Throughput

Latency vs Throughput

Every instruction has two characteristics:

  • Latency: Cycles from issue to result availability (when dependent instructions can use the output).
  • Throughput: How many operations of this type the execution unit can start per cycle.
📊

Instruction Latencies and Throughputs (Hopper SM 9.0)

InstructionLatency (cycles)Throughput (per SM per cycle)Execution Unit
FP32 FMA 4 128 (4 partitions x 32 cores) FP32 pipe
FP64 FMA 8 64 (4 x 16 cores) FP64 pipe
INT32 ADD/MUL 4 64 (4 x 16 cores) INT32 pipe
Shared memory load 20-30 128 B/cycle per partition LD/ST
L1 cache hit ~33 128 B/cycle per partition LD/ST
L2 cache hit ~200 32 B/cycle to L2 Memory subsystem
HBM (global memory) ~500 3,350 GB/s peak Memory subsystem
SFU (sin, cos, rsqrt) 8 16 (4 x 4 SFUs) SFU
Tensor core HMMA (Ampere) 16-32 4 per SM (1 per partition) Tensor core
Tensor core WGMMA (Hopper) ~64 4 per SM (warp-group) Tensor core
Note: Latency determines how many warps are needed to hide a stall. Throughput determines peak sustained performance.

The Latency Hiding Equation

To fully hide the latency of an instruction, you need enough independent work (from other warps) to keep the execution unit busy during the stall:

warps_needed=instruction_latencyinstruction_throughput_per_warp\text{warps\_needed} = \frac{\text{instruction\_latency}}{\text{instruction\_throughput\_per\_warp}}

For HBM loads with ~500-cycle latency, where each warp can issue one load instruction per cycle to the LD/ST unit:

warps_needed=5001=500 warp-cycles of work\text{warps\_needed} = \frac{500}{1} = 500 \text{ warp-cycles of work}

With each warp producing ~10 instructions between loads:

warps_needed=50010=50 warps\text{warps\_needed} = \frac{500}{10} = 50 \text{ warps}

But each partition holds at most 12 warps (48 per SM, 4 partitions). This means you need the instructions between loads to be at least 500/1242500 / 12 \approx 42 cycles of useful work per warp. If your kernel only does 5 arithmetic instructions between loads, you need 500/5=100500 / 5 = 100 warps — far more than the hardware supports. This is the fundamental tension: memory-bound kernels cannot fully hide HBM latency at any occupancy level.

The Occupancy Saturation Point

For compute-bound kernels (high arithmetic intensity), full latency hiding typically requires 25-50% occupancy. For memory-bound kernels, even 100% occupancy often cannot fully hide HBM latency. The practical target is the “occupancy knee” — the point where increasing occupancy yields diminishing performance returns. Profiling with Nsight Compute reveals this as the sm__warps_active.avg metric approaching a plateau relative to sm__warps_active.avg.peak.

Warp Stall Reasons

What Nsight Compute Reports

When a warp scheduler finds no eligible warp to issue, that cycle is a stall cycle. Nsight Compute categorizes stall reasons:

// Example kernel for demonstrating stall reasons
__global__ void stall_demo(float* __restrict__ out,
                           const float* __restrict__ in,
                           int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    float val = in[idx];          // Potential: stall_long_scoreboard (HBM load)
    val = sinf(val);              // Potential: stall_math_pipe (SFU busy)
    val = val * val + 1.0f;       // Low latency, likely no stall
    __syncthreads();              // Potential: stall_barrier
    out[idx] = val;               // Potential: stall_long_scoreboard (store)
}
📊

Warp Stall Reasons in Nsight Compute

Stall ReasonMeaningTypical Cause
stall_long_scoreboard Waiting for a long-latency operation (L2/HBM load) Memory-bound kernel, poor caching
stall_short_scoreboard Waiting for a short-latency operation (shared mem, L1) Shared memory bank conflicts
stall_math_pipe Math execution pipe is full Compute-bound, back-to-back FMAs
stall_barrier Waiting at __syncthreads() Load imbalance between warps in a block
stall_membar Waiting for memory fence to complete Excessive __threadfence() usage
stall_not_selected Warp was eligible but scheduler chose another Normal in high-occupancy scenarios
stall_mio_throttle Memory instruction queue is full Too many outstanding loads
stall_tex_throttle Texture unit queue is full Heavy texture/surface usage
stall_lg_throttle Local/global memory queue is full Register spills to local memory
stall_dispatch Instruction dispatch stall Rare, usually hardware pipeline bubbles
Note: stall_long_scoreboard dominating (above 30%) means the kernel is memory-bound. stall_math_pipe dominating means compute-bound.

Profiling Stall Reasons

# Profile warp stall reasons with Nsight Compute
ncu --metrics \
  smsp__warps_issue_stalled_long_scoreboard_per_issue_active.ratio,\
  smsp__warps_issue_stalled_short_scoreboard_per_issue_active.ratio,\
  smsp__warps_issue_stalled_math_pipe_throttle_per_issue_active.ratio,\
  smsp__warps_issue_stalled_barrier_per_issue_active.ratio,\
  smsp__warps_issue_stalled_not_selected_per_issue_active.ratio,\
  smsp__warps_issue_stalled_mio_throttle_per_issue_active.ratio \
  ./my_kernel
# Output (example):
# smsp__warps_issue_stalled_long_scoreboard_per_issue_active.ratio    47.2%
# smsp__warps_issue_stalled_not_selected_per_issue_active.ratio       23.1%
# smsp__warps_issue_stalled_math_pipe_throttle_per_issue_active.ratio 15.3%
# smsp__warps_issue_stalled_barrier_per_issue_active.ratio             8.4%
# smsp__warps_issue_stalled_short_scoreboard_per_issue_active.ratio    3.8%
# smsp__warps_issue_stalled_mio_throttle_per_issue_active.ratio        2.2%

In this example, 47.2% of issue slots are lost to long scoreboard stalls (HBM latency). The optimization path: reduce memory traffic (tiling, caching) or increase arithmetic intensity (fuse more computation per load).

Dual-Issue and Instruction-Level Parallelism

Historical: Dual-Issue on Kepler

Kepler (SM 3.5) could issue two instructions per cycle from the same warp if they used different execution units — for example, an FP32 FMA and a load in the same cycle. This was called dual-issue or instruction-level parallelism (ILP) at the scheduler level.

Modern GPUs: Single-Issue with Wider Execution

Volta and later architectures removed dual-issue in favor of wider execution units. Instead of issuing two instructions per cycle from one warp, a single instruction operates on 32 threads simultaneously. The throughput is maintained by having four independent scheduler partitions per SM.

// Kepler dual-issue (historical):
Cycle 1: Warp 0 → FMA R4, R1, R2  AND  LOAD R5, [addr]  (two instructions)

// Hopper single-issue (current):
Cycle 1: Partition 0, Warp 0 → FMA R4, R1, R2    (one instruction)
Cycle 1: Partition 1, Warp 4 → LOAD R5, [addr]    (one instruction, different partition)
Cycle 1: Partition 2, Warp 8 → FMA R6, R3, R7    (one instruction, different partition)
Cycle 1: Partition 3, Warp 12 → IADD R8, R9, R10  (one instruction, different partition)

The net throughput is the same (4 instructions per SM per cycle), but the single-issue model simplifies the hardware and reduces power consumption.

Instruction-Level Parallelism Within a Warp

Even though the scheduler issues one instruction per cycle, ILP within a warp still matters. The compiler reorders instructions to maximize the gap between a write and its dependent read:

// Before compiler scheduling (pseudocode):
R4 = LOAD [A + idx]     // Cycle 1: issue, ~500 cycle latency
R5 = R4 * R1            // Cycle 2: STALL — R4 not ready
R6 = LOAD [B + idx]     // Must wait for R5 to complete
R7 = R6 + R2

// After compiler scheduling (ptxas reorders):
R4 = LOAD [A + idx]     // Cycle 1: issue
R6 = LOAD [B + idx]     // Cycle 2: issue (independent of R4)
R8 = R1 + R2            // Cycle 3: issue (independent of R4, R6)
R9 = R3 * R3            // Cycle 4: issue (independent)
...                      // More independent instructions
R5 = R4 * R1            // Cycle N: R4 now ready (fewer stall cycles)
R7 = R6 + R2            // Cycle N+1: R6 now ready

The compiler converts data-level parallelism (independent loads) into instruction-level parallelism, reducing the number of cycles each warp stalls.

Warp Scheduling for Tensor Core Operations

HMMA on Ampere

On Ampere (SM 8.0), tensor core operations use the HMMA (Half-precision Matrix Multiply-Accumulate) instruction. Each HMMA operates on a single warp (32 threads) and computes a 16x8x16 tile:

// Ampere HMMA: one warp computes m16n8k16 tile
// Fragments spread across 32 threads
// Scheduler issues HMMA, takes ~16 cycles to complete
// During those 16 cycles, the scheduler issues from other warps

The scheduler treats HMMA like any other multi-cycle instruction: it issues it, marks the destination registers as pending, and moves on to other warps.

WGMMA on Hopper

Hopper introduces Warp Group Matrix Multiply-Accumulate (WGMMA), which operates on a warp group (4 warps, 128 threads). The scheduler coordinates all 4 warps in the group to collectively execute a larger matrix operation (up to m64n256k16).

// Hopper WGMMA: 4 warps (128 threads) compute m64n256k16
// The scheduler treats the warp group as a scheduling unit
// All 4 warps in the group are committed to the WGMMA for ~64 cycles
// Other warp groups on the same SM can be scheduled concurrently
⚠️ WGMMA and Occupancy

WGMMA consumes 4 warps for the duration of the operation (~64 cycles). With 48 warps maximum per SM, you can have at most 12 warp groups. If WGMMA latency is 64 cycles and the scheduler needs to keep the tensor core busy, you need at least 64/1=6464 / 1 = 64 warp-cycles of other work — meaning at least 2 warp groups ready to issue. In practice, 4-6 warp groups (16-24 warps, 33-50% occupancy) suffice for tensor core-heavy kernels.

TMA and Asynchronous Scheduling

Hopper’s Tensor Memory Accelerator (TMA) further changes scheduling dynamics. TMA operations are issued by the scheduler but execute independently in the memory subsystem:

// TMA loads data from global memory into shared memory
// without consuming LD/ST units or scoreboard entries for intermediate registers
__pipeline_memcpy_async(
    shared_ptr,   // destination in shared memory
    global_ptr,   // source in global memory
    sizeof(tile)  // size
);
// The warp can immediately proceed to other instructions
// TMA completion is signaled via a barrier, not a register write

This means the scheduler does not see a scoreboard dependency for TMA loads — the warp can continue issuing instructions until it explicitly waits on the TMA completion barrier. This fundamentally changes the latency hiding equation because loads no longer occupy scoreboard slots.

Quantifying Latency Hiding

The Occupancy Model

Let LL be the latency of the dominant memory operation (cycles), NN be the number of independent instructions a warp can execute between stalls, and WW be the number of resident warps per partition:

utilization=min(1,W×NL)\text{utilization} = \min\left(1, \frac{W \times N}{L}\right)

If W×NLW \times N \geq L, the execution unit is fully utilized (no idle cycles). If W×N<LW \times N \lt L, the scheduler cannot find enough work, and cycles are wasted.

For an HBM-bound kernel on H100:

  • L=500L = 500 cycles (HBM latency)
  • N=8N = 8 instructions between loads (typical for a streaming kernel)
  • W=12W = 12 warps per partition (maximum occupancy)

utilization=min(1,12×8500)=min(1,0.192)=19.2%\text{utilization} = \min\left(1, \frac{12 \times 8}{500}\right) = \min(1, 0.192) = 19.2\%

This kernel wastes 80.8% of cycles waiting on memory. The only solutions: increase NN (more computation per load — algorithm change) or reduce LL (use L2/L1 cache or shared memory — tiling).

Execution Unit Utilization vs Resident Warps (HBM-Bound Kernel, N=8)

(%)
2 warps (17% occ.) 2*8/500 = 3.2%
3.2 %
4 warps (33% occ.) 4*8/500 = 6.4%
6.4 %
8 warps (67% occ.) 8*8/500 = 12.8%
12.8 %
12 warps (100% occ.) 12*8/500 = 19.2%
19.2 %
12 warps, N=20 (tiled) Tiling increases N
48 %
12 warps, L1 hit (L=33) Cache hit eliminates stall
100 %

Practical Example: SAXPY

The SAXPY kernel y[i] = a * x[i] + y[i] loads 2 floats (8 bytes), performs 2 FLOPs (1 multiply, 1 add), and stores 1 float (4 bytes). Arithmetic intensity: 2/12=0.1672 / 12 = 0.167 FLOPs/byte.

__global__ void saxpy(float a, const float* __restrict__ x,
                      float* __restrict__ y, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        y[idx] = a * x[idx] + y[idx];
    }
}

Between loads, each thread executes 2 floating-point instructions (FMA can fuse into 1). With 32 threads per warp, that is 1 instruction between loads. N=1N = 1.

utilization=12×1500=2.4%\text{utilization} = \frac{12 \times 1}{500} = 2.4\%

SAXPY is fundamentally memory-bound. The 97.6% of cycles where the FP32 units are idle is not a flaw — it reflects the algorithm’s low arithmetic intensity. The performance ceiling is HBM bandwidth (3,350 GB/s on H100), not compute throughput.

# Profile SAXPY to confirm memory-bound behavior
ncu --set full -k saxpy ./saxpy_benchmark
# Expected: sm__sass_thread_inst_executed_op_memory_percentage > 60%
# Expected: stall_long_scoreboard > 50%
# Expected: achieved bandwidth close to 3,350 GB/s

Case Study: Matrix Multiplication Tiling

Contrast with tiled matrix multiplication, where each thread block loads a tile into shared memory, then performs N=TILE_SIZEN = TILE\_SIZE FMA operations per loaded element:

// Tile size 32: each loaded element is used in 32 FMAs
// N = 32 instructions between loads (shared memory, L=30 cycles)
// utilization = min(1, 12 * 32 / 30) = min(1, 12.8) = 100%

With tiling, the scheduler has more than enough work to hide shared memory latency. The kernel becomes compute-bound, and the stall profile shifts from stall_long_scoreboard to stall_math_pipe.

Instruction Replay

What Causes Replays

An instruction replay occurs when a warp issues an instruction but the instruction cannot complete on its first attempt. The scheduler must re-issue it. Common causes:

  • Uncoalesced memory access: A 32-thread warp accesses addresses that span multiple 128-byte cache lines. The hardware issues one transaction per cache line, requiring multiple passes.
  • Shared memory bank conflicts: Two threads in the same warp access the same shared memory bank (but different addresses). The hardware serializes conflicting accesses.
  • Predicated instructions on divergent warps: Threads that fail the predicate are masked, but if the execution requires multiple passes (e.g., divergent atomic), replays occur.
// Example: uncoalesced access causing replays
__global__ void strided_access(float* data, int stride) {
    int idx = threadIdx.x * stride;  // stride > 1 causes non-coalesced access
    data[idx] = data[idx] + 1.0f;
}
// stride=1: 1 transaction (128 bytes, 32 x 4-byte floats)
// stride=2: 2 transactions (threads span 256 bytes)
// stride=32: 32 transactions (each thread's data is in a different cache line)
// Each extra transaction is a replay of the load instruction

Measuring Replays

ncu --metrics \
  l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum,\
  l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum \
  ./strided_kernel
# sectors / requests > 1 indicates replays
# Ideal: 4 sectors per request (128B line / 32B sector)
# stride=32: 128 sectors per request (32x replay)

Replays are invisible at the CUDA source level but directly consume scheduler issue slots. A warp that replays 32 times for a single load instruction occupies the scheduler for 32 cycles instead of 1. This reduces the effective occupancy seen by other warps.

🚨 Replays Are Hidden Throughput Killers

A kernel may show 100% occupancy and zero scoreboard stalls, yet achieve only 20% of peak bandwidth. The cause is often replays from uncoalesced accesses or bank conflicts. Nsight Compute’s “Memory Workload Analysis” section shows the replay ratio. An ideal replay ratio is 1.0 (no replays). Ratios above 2.0 indicate significant replay overhead.

Warp Divergence and the Scheduler

Branch Divergence Mechanics

When threads within a warp take different paths at a branch, the warp must execute both paths — first one side, then the other. The scheduler issues the instruction stream for the taken path (with a thread mask disabling threads that did not take the branch), then issues the not-taken path.

__global__ void divergent_kernel(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        if (in[idx] > 0.0f) {
            out[idx] = sqrtf(in[idx]);  // Path A: SFU instruction
        } else {
            out[idx] = -in[idx] * in[idx];  // Path B: FP32 multiply
        }
    }
}

If 16 threads take Path A and 16 take Path B, the scheduler issues:

  1. Path A instructions with a 16-thread mask (16 threads active, 16 masked)
  2. Path B instructions with a 16-thread mask (the other 16 active)

The warp takes the time of both paths combined, but each path operates at half throughput (16 active threads instead of 32). Worst case: 1 thread takes one path, 31 take the other — 50% of the warp’s computation is wasted.

Volta’s Independent Thread Scheduling

Starting with Volta, NVIDIA introduced independent thread scheduling. The hardware maintains a per-thread program counter and call stack, allowing threads within a warp to diverge and execute independently at sub-warp granularity:

// Pre-Volta: full-warp reconvergence at branch join point
// Volta+: threads reconverge opportunistically, can interleave paths

// This enables patterns like:
__global__ void cooperative_kernel(int* lock) {
    // Thread 0 acquires lock, other threads spin
    // Pre-Volta: DEADLOCK (all threads in warp must execute same instruction)
    // Volta+: Thread 0 proceeds, others spin independently
    if (threadIdx.x == 0) {
        while (atomicCAS(lock, 0, 1) != 0) {}  // acquire
        // critical section
        atomicExch(lock, 0);                     // release
    }
    __syncwarp();
}
ℹ️ Independent Thread Scheduling Cost

Independent thread scheduling adds overhead: the hardware must track per-thread state and perform more complex convergence analysis. In practice, the overhead is small (a few percent) and is offset by the ability to execute more complex algorithms without deadlock. The compiler inserts __syncwarp() at reconvergence points to explicitly synchronize threads within a warp.

Practical Optimization Strategies

Strategy 1: Increase Arithmetic Intensity

The most effective way to improve scheduler utilization on memory-bound kernels is to increase the number of compute instructions between memory accesses:

// Before: 1 FMA per load (arithmetic intensity = 0.167 FLOP/byte)
y[i] = a * x[i] + y[i];

// After: loop tiling, each element loaded once, used TILE_SIZE times
// Load tile of A and B into shared memory
// Each element participates in TILE_SIZE FMAs
// Arithmetic intensity = TILE_SIZE * 2 / (2 * sizeof(float))
// TILE_SIZE=32: intensity = 64/8 = 8.0 FLOP/byte

Strategy 2: Prefetch to Hide Latency

Software prefetching starts loads early, before the data is needed. The scheduler can issue other instructions while the prefetch completes:

__global__ void prefetch_kernel(const float* __restrict__ in,
                                 float* __restrict__ out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    // Prefetch first element
    float next = (idx < n) ? in[idx] : 0.0f;

    for (int i = idx; i < n; i += stride) {
        float curr = next;
        // Prefetch next iteration's data while processing current
        if (i + stride < n) {
            next = in[i + stride];  // This load overlaps with computation below
        }
        // Compute on current data (these instructions hide the prefetch latency)
        float result = curr * curr;
        result = result + curr;
        result = sqrtf(result);
        out[i] = result;
    }
}

Strategy 3: Use Asynchronous Copy (Hopper)

#include <cuda/pipeline>

__global__ void async_copy_kernel(const float* gmem, float* out, int n) {
    __shared__ float smem[256];
    auto pipe = cuda::make_pipeline();

    // Stage 1: async copy from global to shared (uses TMA, no scoreboard)
    pipe.producer_acquire();
    cuda::memcpy_async(smem + threadIdx.x, gmem + blockIdx.x * 256 + threadIdx.x,
                       sizeof(float), pipe);
    pipe.producer_commit();

    // Stage 2: compute on previously loaded data (overlaps with Stage 1 copy)
    // ... computation here runs while copy completes ...

    // Stage 3: wait for copy to complete
    pipe.consumer_wait();

    // Stage 4: use the data
    out[blockIdx.x * 256 + threadIdx.x] = smem[threadIdx.x] * 2.0f;

    pipe.consumer_release();
}

Strategy 4: Tune Block Size for Occupancy

// Query occupancy for different block sizes
int min_grid_size, block_size;
cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size,
                                    my_kernel, 0, 0);
// block_size is chosen to maximize occupancy given the kernel's register usage

// Or manually query:
int max_blocks_per_sm;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_sm,
                                               my_kernel, 256, 0);
printf("Max blocks per SM with 256 threads: %d\n", max_blocks_per_sm);
printf("Occupancy: %.1f%%\n",
       100.0f * max_blocks_per_sm * 256 / 1536);  // 1536 max threads on Hopper

GEMM Kernel Performance vs Occupancy (H100)

(% of peak TFLOPS)
16 regs/thread, 100% occ. High occupancy but register spills
72 % of peak TFLOPS
32 regs/thread, 67% occ. Good balance, no spills
91 % of peak TFLOPS
64 regs/thread, 50% occ. Optimal: data in registers
95 % of peak TFLOPS
128 regs/thread, 25% occ. Not enough warps for latency hiding
88 % of peak TFLOPS
255 regs/thread, 13% occ. Severe under-occupancy
61 % of peak TFLOPS

Nsight Compute Warp Scheduler Section

Key Metrics to Monitor

# Comprehensive scheduler profiling
ncu --metrics \
  smsp__issue_active.avg.pct_of_peak_sustained_active,\
  smsp__inst_executed.avg.per_cycle_active,\
  smsp__warps_active.avg,\
  smsp__warps_eligible.avg.per_cycle_active,\
  smsp__warps_active.avg.pct_of_peak_sustained_active \
  -k my_kernel ./my_app

# Interpretation:
# issue_active: % of cycles where at least one instruction was issued
# inst_per_cycle: average instructions issued per active cycle
# warps_active: average number of warps resident on each SM
# warps_eligible: average number of warps eligible for issue per cycle
# If warps_eligible << warps_active, most warps are stalled
📊

Scheduler Health Indicators

MetricHealthy ValueUnhealthy ValueAction
issue_active Above 80% Below 50% Increase occupancy or ILP
warps_eligible / warps_active Above 0.5 Below 0.1 Reduce memory stalls (tiling, caching)
inst_per_cycle Close to 1.0 Below 0.5 Scheduler cannot find work — increase parallelism
achieved_occupancy Above 50% Below 25% Reduce register usage or shared memory
stall_long_scoreboard % Below 20% Above 40% Kernel is memory-bound — tile or prefetch
replay_ratio 1.0 Above 2.0 Fix uncoalesced accesses or bank conflicts
Note: These thresholds are guidelines for typical HPC/AI kernels. Bandwidth-limited streaming kernels (like SAXPY) may show 'unhealthy' values that are inherent to the algorithm, not fixable through scheduling optimization.

Summary: The Scheduling Hierarchy

The GPU scheduling hierarchy operates at three levels:

  1. Grid level: The GigaThread engine distributes thread blocks to SMs. Blocks are assigned to SMs that have sufficient resources (registers, shared memory, warp slots).

  2. SM level: Each SM has 4 warp schedulers (one per partition). Warps from multiple blocks are interleaved across partitions.

  3. Warp level: Each scheduler selects one warp per cycle using GTO policy. The scoreboard tracks register dependencies. Stalled warps are skipped at zero cost.

The entire system is designed around one principle: keep execution units busy by maintaining a large pool of independent work. When a warp stalls on memory (500 cycles), 11 other warps in that partition can fill the gap. When a block hits a barrier, other blocks on the same SM continue. When a kernel underutilizes one SM, the GigaThread engine launches new blocks on remaining SMs.

The practical consequence for kernel developers: measure stall reasons with Nsight Compute, identify whether the kernel is memory-bound or compute-bound, and apply the appropriate strategy — tiling and caching for memory-bound kernels, reducing occupancy constraints for compute-bound kernels, and always measuring the actual utilization against the theoretical model.