An H100 SM partition contains a single warp scheduler that issues one instruction per clock cycle. With four partitions per SM, each SM issues at most four instructions per cycle. At 1.83 GHz boost clock, that is billion instructions per second per SM, or billion instructions per second across the full GPU. The scheduler’s job is to select, every single cycle, which warp to run next, from a pool of up to 12 resident warps per partition. When a warp stalls on a memory load that takes 500 cycles, the scheduler immediately switches to another ready warp at zero cost. This is the entire GPU execution model: hide latency by having far more work in flight than execution units.
This post dissects the warp scheduler microarchitecture, the scoreboard mechanism that tracks instruction dependencies, the scheduling policy, dual-issue rules on older architectures, the common stall reasons that Nsight Compute reports, and the arithmetic for determining how many warps you actually need to fully hide memory latency.
The Warp Scheduler Pipeline
From Instruction Fetch to Execution
Each warp scheduler operates a multi-stage pipeline:
- Instruction fetch: The scheduler reads the next instruction from the I-cache using the warp’s program counter (PC).
- Decode: The instruction is decoded to determine operands, execution unit, and latency.
- Scoreboard check: The scheduler verifies that all source operands are available (not pending from a prior instruction).
- Operand collection: Register values are read from the register file.
- Dispatch: The instruction is sent to the appropriate execution unit (INT32, FP32, FP64, tensor core, LD/ST, SFU).
The scheduler can issue one new instruction every cycle, but the instruction it issues may take multiple cycles to produce its result. The key invariant is that the scheduler never stalls itself — if the current warp is not ready, it picks a different warp.
Cycle 1: Scheduler checks Warp 0 → READY → Issues FMA to FP32 pipe
Cycle 2: Scheduler checks Warp 1 → READY → Issues LOAD to LD/ST
Cycle 3: Scheduler checks Warp 2 → READY → Issues FMA to FP32 pipe
Cycle 4: Scheduler checks Warp 0 → STALLED (waiting for prior result) → skip
Scheduler checks Warp 3 → READY → Issues SIN to SFU
...
Cycle 502: Warp 1's LOAD completes → Warp 1 becomes READY again
The scheduler examines warps in priority order, not round-robin. If Warp 0 becomes ready again before Warp 3, the scheduler issues from Warp 0.
The Scoreboard
The scoreboard is a hardware structure that tracks which registers have pending writes. When an instruction is issued, the destination register is marked “pending” in the scoreboard. When the instruction completes (the result is written back to the register file), the pending bit is cleared.
A warp is eligible for issue only when:
- Its next instruction’s source registers have no pending writes (all operands available)
- The required execution unit (FP32, LD/ST, SFU, tensor core) is not fully occupied
- The warp is not waiting on a barrier (
__syncthreads(),__syncwarp())
Scoreboard Stall Example
| Cycle | Action | Scoreboard State for Warp 0 |
|---|---|---|
| 1 | Issue: R4 = LOAD [addr] | R4 = PENDING |
| 2 | Issue: R5 = R1 + R2 (no dependency) | R4 = PENDING, R5 = PENDING |
| 3 | Check: R6 = R4 * R3 → R4 PENDING → STALL | R4 = PENDING, R5 = done |
| 4-500 | Warp 0 stalled. Other warps issue. | R4 = PENDING |
| 501 | LOAD completes, R4 written back | R4 = done |
| 502 | Issue: R6 = R4 * R3 → R4 ready → GO | R6 = PENDING |
Dependency Tracking Granularity
The scoreboard tracks dependencies at the register level, not the instruction level. If instruction A writes R4 and instruction B reads R7, B can issue in the same cycle that A completes — there is no pipeline bubble. The only stall condition is a true data dependency: a read-after-write (RAW) hazard on the same register.
Write-after-read (WAR) and write-after-write (WAW) hazards are handled by the in-order nature of warp execution. Within a single warp, instructions issue in program order. The compiler (ptxas) schedules instructions to maximize the distance between a write and its dependent read — this is the primary optimization the CUDA compiler performs at the SASS level.
Scheduling Policy
Greedy-Then-Oldest (GTO)
NVIDIA does not publish the exact scheduling policy, but extensive microbenchmarking (Jia et al., 2018; Volkov, 2016) has established that modern GPUs use a policy close to Greedy-Then-Oldest (GTO):
- The scheduler preferentially re-issues from the same warp that issued last cycle, if that warp is still ready. This improves temporal locality — consecutive instructions from the same warp likely access the same registers and cache lines.
- If the current warp stalls, the scheduler selects the oldest ready warp (the one that has waited longest since its last issue).
GTO is a good fit for GPU workloads because:
- Issuing consecutive instructions from the same warp maximizes register reuse and reduces operand collector pressure.
- Falling back to oldest-first prevents starvation — no warp waits indefinitely.
// GTO scheduling trace (4 warps, 1 scheduler)
Cycle 1: Warp 0 issues (GTO: prefer current)
Cycle 2: Warp 0 issues (still ready, GTO stays)
Cycle 3: Warp 0 issues
Cycle 4: Warp 0 stalls (scoreboard: waiting on LOAD)
→ Select oldest ready → Warp 1 (waited 3 cycles)
Cycle 5: Warp 1 issues (GTO: prefer current = Warp 1)
Cycle 6: Warp 1 issues
Cycle 7: Warp 1 stalls → Select oldest → Warp 2
...
Volta and Turing use an independent thread scheduling model where threads within a warp can diverge and reconverge at fine granularity. The scheduler on these architectures tracks per-thread rather than per-warp state in certain situations. Hopper refines this further with warp group scheduling for WGMMA instructions, where 4 warps (128 threads) are scheduled as a unit for tensor core operations.
Round-Robin vs GTO: Why It Matters
Consider a kernel where each warp performs a sequence: LOAD (500 cycles), then 10 FMAs (1 cycle each), then STORE (500 cycles). With 8 warps:
Round-Robin: Issues from Warp 0, then Warp 1, then Warp 2, … Each warp gets one instruction, then waits 7 cycles before its next turn. The 10 FMAs that could execute back-to-back are spread across 70 cycles.
GTO: Issues all 10 FMAs from Warp 0 in 10 consecutive cycles, then hits the STORE and switches. The FMA sequence completes 7x faster for each warp, and the register file has better locality because the same warp’s registers are accessed repeatedly.
Instruction Latency and Throughput
Latency vs Throughput
Every instruction has two characteristics:
- Latency: Cycles from issue to result availability (when dependent instructions can use the output).
- Throughput: How many operations of this type the execution unit can start per cycle.
Instruction Latencies and Throughputs (Hopper SM 9.0)
| Instruction | Latency (cycles) | Throughput (per SM per cycle) | Execution Unit |
|---|---|---|---|
| FP32 FMA | 4 | 128 (4 partitions x 32 cores) | FP32 pipe |
| FP64 FMA | 8 | 64 (4 x 16 cores) | FP64 pipe |
| INT32 ADD/MUL | 4 | 64 (4 x 16 cores) | INT32 pipe |
| Shared memory load | 20-30 | 128 B/cycle per partition | LD/ST |
| L1 cache hit | ~33 | 128 B/cycle per partition | LD/ST |
| L2 cache hit | ~200 | 32 B/cycle to L2 | Memory subsystem |
| HBM (global memory) | ~500 | 3,350 GB/s peak | Memory subsystem |
| SFU (sin, cos, rsqrt) | 8 | 16 (4 x 4 SFUs) | SFU |
| Tensor core HMMA (Ampere) | 16-32 | 4 per SM (1 per partition) | Tensor core |
| Tensor core WGMMA (Hopper) | ~64 | 4 per SM (warp-group) | Tensor core |
The Latency Hiding Equation
To fully hide the latency of an instruction, you need enough independent work (from other warps) to keep the execution unit busy during the stall:
For HBM loads with ~500-cycle latency, where each warp can issue one load instruction per cycle to the LD/ST unit:
With each warp producing ~10 instructions between loads:
But each partition holds at most 12 warps (48 per SM, 4 partitions). This means you need the instructions between loads to be at least cycles of useful work per warp. If your kernel only does 5 arithmetic instructions between loads, you need warps — far more than the hardware supports. This is the fundamental tension: memory-bound kernels cannot fully hide HBM latency at any occupancy level.
For compute-bound kernels (high arithmetic intensity), full latency hiding typically requires 25-50% occupancy. For memory-bound kernels, even 100% occupancy often cannot fully hide HBM latency. The practical target is the “occupancy knee” — the point where increasing occupancy yields diminishing performance returns. Profiling with Nsight Compute reveals this as the sm__warps_active.avg metric approaching a plateau relative to sm__warps_active.avg.peak.
Warp Stall Reasons
What Nsight Compute Reports
When a warp scheduler finds no eligible warp to issue, that cycle is a stall cycle. Nsight Compute categorizes stall reasons:
// Example kernel for demonstrating stall reasons
__global__ void stall_demo(float* __restrict__ out,
const float* __restrict__ in,
int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n) return;
float val = in[idx]; // Potential: stall_long_scoreboard (HBM load)
val = sinf(val); // Potential: stall_math_pipe (SFU busy)
val = val * val + 1.0f; // Low latency, likely no stall
__syncthreads(); // Potential: stall_barrier
out[idx] = val; // Potential: stall_long_scoreboard (store)
}
Warp Stall Reasons in Nsight Compute
| Stall Reason | Meaning | Typical Cause |
|---|---|---|
| stall_long_scoreboard | Waiting for a long-latency operation (L2/HBM load) | Memory-bound kernel, poor caching |
| stall_short_scoreboard | Waiting for a short-latency operation (shared mem, L1) | Shared memory bank conflicts |
| stall_math_pipe | Math execution pipe is full | Compute-bound, back-to-back FMAs |
| stall_barrier | Waiting at __syncthreads() | Load imbalance between warps in a block |
| stall_membar | Waiting for memory fence to complete | Excessive __threadfence() usage |
| stall_not_selected | Warp was eligible but scheduler chose another | Normal in high-occupancy scenarios |
| stall_mio_throttle | Memory instruction queue is full | Too many outstanding loads |
| stall_tex_throttle | Texture unit queue is full | Heavy texture/surface usage |
| stall_lg_throttle | Local/global memory queue is full | Register spills to local memory |
| stall_dispatch | Instruction dispatch stall | Rare, usually hardware pipeline bubbles |
Profiling Stall Reasons
# Profile warp stall reasons with Nsight Compute
ncu --metrics \
smsp__warps_issue_stalled_long_scoreboard_per_issue_active.ratio,\
smsp__warps_issue_stalled_short_scoreboard_per_issue_active.ratio,\
smsp__warps_issue_stalled_math_pipe_throttle_per_issue_active.ratio,\
smsp__warps_issue_stalled_barrier_per_issue_active.ratio,\
smsp__warps_issue_stalled_not_selected_per_issue_active.ratio,\
smsp__warps_issue_stalled_mio_throttle_per_issue_active.ratio \
./my_kernel
# Output (example):
# smsp__warps_issue_stalled_long_scoreboard_per_issue_active.ratio 47.2%
# smsp__warps_issue_stalled_not_selected_per_issue_active.ratio 23.1%
# smsp__warps_issue_stalled_math_pipe_throttle_per_issue_active.ratio 15.3%
# smsp__warps_issue_stalled_barrier_per_issue_active.ratio 8.4%
# smsp__warps_issue_stalled_short_scoreboard_per_issue_active.ratio 3.8%
# smsp__warps_issue_stalled_mio_throttle_per_issue_active.ratio 2.2%
In this example, 47.2% of issue slots are lost to long scoreboard stalls (HBM latency). The optimization path: reduce memory traffic (tiling, caching) or increase arithmetic intensity (fuse more computation per load).
Dual-Issue and Instruction-Level Parallelism
Historical: Dual-Issue on Kepler
Kepler (SM 3.5) could issue two instructions per cycle from the same warp if they used different execution units — for example, an FP32 FMA and a load in the same cycle. This was called dual-issue or instruction-level parallelism (ILP) at the scheduler level.
Modern GPUs: Single-Issue with Wider Execution
Volta and later architectures removed dual-issue in favor of wider execution units. Instead of issuing two instructions per cycle from one warp, a single instruction operates on 32 threads simultaneously. The throughput is maintained by having four independent scheduler partitions per SM.
// Kepler dual-issue (historical):
Cycle 1: Warp 0 → FMA R4, R1, R2 AND LOAD R5, [addr] (two instructions)
// Hopper single-issue (current):
Cycle 1: Partition 0, Warp 0 → FMA R4, R1, R2 (one instruction)
Cycle 1: Partition 1, Warp 4 → LOAD R5, [addr] (one instruction, different partition)
Cycle 1: Partition 2, Warp 8 → FMA R6, R3, R7 (one instruction, different partition)
Cycle 1: Partition 3, Warp 12 → IADD R8, R9, R10 (one instruction, different partition)
The net throughput is the same (4 instructions per SM per cycle), but the single-issue model simplifies the hardware and reduces power consumption.
Instruction-Level Parallelism Within a Warp
Even though the scheduler issues one instruction per cycle, ILP within a warp still matters. The compiler reorders instructions to maximize the gap between a write and its dependent read:
// Before compiler scheduling (pseudocode):
R4 = LOAD [A + idx] // Cycle 1: issue, ~500 cycle latency
R5 = R4 * R1 // Cycle 2: STALL — R4 not ready
R6 = LOAD [B + idx] // Must wait for R5 to complete
R7 = R6 + R2
// After compiler scheduling (ptxas reorders):
R4 = LOAD [A + idx] // Cycle 1: issue
R6 = LOAD [B + idx] // Cycle 2: issue (independent of R4)
R8 = R1 + R2 // Cycle 3: issue (independent of R4, R6)
R9 = R3 * R3 // Cycle 4: issue (independent)
... // More independent instructions
R5 = R4 * R1 // Cycle N: R4 now ready (fewer stall cycles)
R7 = R6 + R2 // Cycle N+1: R6 now ready
The compiler converts data-level parallelism (independent loads) into instruction-level parallelism, reducing the number of cycles each warp stalls.
Warp Scheduling for Tensor Core Operations
HMMA on Ampere
On Ampere (SM 8.0), tensor core operations use the HMMA (Half-precision Matrix Multiply-Accumulate) instruction. Each HMMA operates on a single warp (32 threads) and computes a 16x8x16 tile:
// Ampere HMMA: one warp computes m16n8k16 tile
// Fragments spread across 32 threads
// Scheduler issues HMMA, takes ~16 cycles to complete
// During those 16 cycles, the scheduler issues from other warps
The scheduler treats HMMA like any other multi-cycle instruction: it issues it, marks the destination registers as pending, and moves on to other warps.
WGMMA on Hopper
Hopper introduces Warp Group Matrix Multiply-Accumulate (WGMMA), which operates on a warp group (4 warps, 128 threads). The scheduler coordinates all 4 warps in the group to collectively execute a larger matrix operation (up to m64n256k16).
// Hopper WGMMA: 4 warps (128 threads) compute m64n256k16
// The scheduler treats the warp group as a scheduling unit
// All 4 warps in the group are committed to the WGMMA for ~64 cycles
// Other warp groups on the same SM can be scheduled concurrently
WGMMA consumes 4 warps for the duration of the operation (~64 cycles). With 48 warps maximum per SM, you can have at most 12 warp groups. If WGMMA latency is 64 cycles and the scheduler needs to keep the tensor core busy, you need at least warp-cycles of other work — meaning at least 2 warp groups ready to issue. In practice, 4-6 warp groups (16-24 warps, 33-50% occupancy) suffice for tensor core-heavy kernels.
TMA and Asynchronous Scheduling
Hopper’s Tensor Memory Accelerator (TMA) further changes scheduling dynamics. TMA operations are issued by the scheduler but execute independently in the memory subsystem:
// TMA loads data from global memory into shared memory
// without consuming LD/ST units or scoreboard entries for intermediate registers
__pipeline_memcpy_async(
shared_ptr, // destination in shared memory
global_ptr, // source in global memory
sizeof(tile) // size
);
// The warp can immediately proceed to other instructions
// TMA completion is signaled via a barrier, not a register write
This means the scheduler does not see a scoreboard dependency for TMA loads — the warp can continue issuing instructions until it explicitly waits on the TMA completion barrier. This fundamentally changes the latency hiding equation because loads no longer occupy scoreboard slots.
Quantifying Latency Hiding
The Occupancy Model
Let be the latency of the dominant memory operation (cycles), be the number of independent instructions a warp can execute between stalls, and be the number of resident warps per partition:
If , the execution unit is fully utilized (no idle cycles). If , the scheduler cannot find enough work, and cycles are wasted.
For an HBM-bound kernel on H100:
- cycles (HBM latency)
- instructions between loads (typical for a streaming kernel)
- warps per partition (maximum occupancy)
This kernel wastes 80.8% of cycles waiting on memory. The only solutions: increase (more computation per load — algorithm change) or reduce (use L2/L1 cache or shared memory — tiling).
Execution Unit Utilization vs Resident Warps (HBM-Bound Kernel, N=8)
(%)Practical Example: SAXPY
The SAXPY kernel y[i] = a * x[i] + y[i] loads 2 floats (8 bytes), performs 2 FLOPs (1 multiply, 1 add), and stores 1 float (4 bytes). Arithmetic intensity: FLOPs/byte.
__global__ void saxpy(float a, const float* __restrict__ x,
float* __restrict__ y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
y[idx] = a * x[idx] + y[idx];
}
}
Between loads, each thread executes 2 floating-point instructions (FMA can fuse into 1). With 32 threads per warp, that is 1 instruction between loads. .
SAXPY is fundamentally memory-bound. The 97.6% of cycles where the FP32 units are idle is not a flaw — it reflects the algorithm’s low arithmetic intensity. The performance ceiling is HBM bandwidth (3,350 GB/s on H100), not compute throughput.
# Profile SAXPY to confirm memory-bound behavior
ncu --set full -k saxpy ./saxpy_benchmark
# Expected: sm__sass_thread_inst_executed_op_memory_percentage > 60%
# Expected: stall_long_scoreboard > 50%
# Expected: achieved bandwidth close to 3,350 GB/s
Case Study: Matrix Multiplication Tiling
Contrast with tiled matrix multiplication, where each thread block loads a tile into shared memory, then performs FMA operations per loaded element:
// Tile size 32: each loaded element is used in 32 FMAs
// N = 32 instructions between loads (shared memory, L=30 cycles)
// utilization = min(1, 12 * 32 / 30) = min(1, 12.8) = 100%
With tiling, the scheduler has more than enough work to hide shared memory latency. The kernel becomes compute-bound, and the stall profile shifts from stall_long_scoreboard to stall_math_pipe.
Instruction Replay
What Causes Replays
An instruction replay occurs when a warp issues an instruction but the instruction cannot complete on its first attempt. The scheduler must re-issue it. Common causes:
- Uncoalesced memory access: A 32-thread warp accesses addresses that span multiple 128-byte cache lines. The hardware issues one transaction per cache line, requiring multiple passes.
- Shared memory bank conflicts: Two threads in the same warp access the same shared memory bank (but different addresses). The hardware serializes conflicting accesses.
- Predicated instructions on divergent warps: Threads that fail the predicate are masked, but if the execution requires multiple passes (e.g., divergent atomic), replays occur.
// Example: uncoalesced access causing replays
__global__ void strided_access(float* data, int stride) {
int idx = threadIdx.x * stride; // stride > 1 causes non-coalesced access
data[idx] = data[idx] + 1.0f;
}
// stride=1: 1 transaction (128 bytes, 32 x 4-byte floats)
// stride=2: 2 transactions (threads span 256 bytes)
// stride=32: 32 transactions (each thread's data is in a different cache line)
// Each extra transaction is a replay of the load instruction
Measuring Replays
ncu --metrics \
l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum,\
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum \
./strided_kernel
# sectors / requests > 1 indicates replays
# Ideal: 4 sectors per request (128B line / 32B sector)
# stride=32: 128 sectors per request (32x replay)
Replays are invisible at the CUDA source level but directly consume scheduler issue slots. A warp that replays 32 times for a single load instruction occupies the scheduler for 32 cycles instead of 1. This reduces the effective occupancy seen by other warps.
A kernel may show 100% occupancy and zero scoreboard stalls, yet achieve only 20% of peak bandwidth. The cause is often replays from uncoalesced accesses or bank conflicts. Nsight Compute’s “Memory Workload Analysis” section shows the replay ratio. An ideal replay ratio is 1.0 (no replays). Ratios above 2.0 indicate significant replay overhead.
Warp Divergence and the Scheduler
Branch Divergence Mechanics
When threads within a warp take different paths at a branch, the warp must execute both paths — first one side, then the other. The scheduler issues the instruction stream for the taken path (with a thread mask disabling threads that did not take the branch), then issues the not-taken path.
__global__ void divergent_kernel(float* out, const float* in, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
if (in[idx] > 0.0f) {
out[idx] = sqrtf(in[idx]); // Path A: SFU instruction
} else {
out[idx] = -in[idx] * in[idx]; // Path B: FP32 multiply
}
}
}
If 16 threads take Path A and 16 take Path B, the scheduler issues:
- Path A instructions with a 16-thread mask (16 threads active, 16 masked)
- Path B instructions with a 16-thread mask (the other 16 active)
The warp takes the time of both paths combined, but each path operates at half throughput (16 active threads instead of 32). Worst case: 1 thread takes one path, 31 take the other — 50% of the warp’s computation is wasted.
Volta’s Independent Thread Scheduling
Starting with Volta, NVIDIA introduced independent thread scheduling. The hardware maintains a per-thread program counter and call stack, allowing threads within a warp to diverge and execute independently at sub-warp granularity:
// Pre-Volta: full-warp reconvergence at branch join point
// Volta+: threads reconverge opportunistically, can interleave paths
// This enables patterns like:
__global__ void cooperative_kernel(int* lock) {
// Thread 0 acquires lock, other threads spin
// Pre-Volta: DEADLOCK (all threads in warp must execute same instruction)
// Volta+: Thread 0 proceeds, others spin independently
if (threadIdx.x == 0) {
while (atomicCAS(lock, 0, 1) != 0) {} // acquire
// critical section
atomicExch(lock, 0); // release
}
__syncwarp();
}
Independent thread scheduling adds overhead: the hardware must track per-thread state and perform more complex convergence analysis. In practice, the overhead is small (a few percent) and is offset by the ability to execute more complex algorithms without deadlock. The compiler inserts __syncwarp() at reconvergence points to explicitly synchronize threads within a warp.
Practical Optimization Strategies
Strategy 1: Increase Arithmetic Intensity
The most effective way to improve scheduler utilization on memory-bound kernels is to increase the number of compute instructions between memory accesses:
// Before: 1 FMA per load (arithmetic intensity = 0.167 FLOP/byte)
y[i] = a * x[i] + y[i];
// After: loop tiling, each element loaded once, used TILE_SIZE times
// Load tile of A and B into shared memory
// Each element participates in TILE_SIZE FMAs
// Arithmetic intensity = TILE_SIZE * 2 / (2 * sizeof(float))
// TILE_SIZE=32: intensity = 64/8 = 8.0 FLOP/byte
Strategy 2: Prefetch to Hide Latency
Software prefetching starts loads early, before the data is needed. The scheduler can issue other instructions while the prefetch completes:
__global__ void prefetch_kernel(const float* __restrict__ in,
float* __restrict__ out, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
// Prefetch first element
float next = (idx < n) ? in[idx] : 0.0f;
for (int i = idx; i < n; i += stride) {
float curr = next;
// Prefetch next iteration's data while processing current
if (i + stride < n) {
next = in[i + stride]; // This load overlaps with computation below
}
// Compute on current data (these instructions hide the prefetch latency)
float result = curr * curr;
result = result + curr;
result = sqrtf(result);
out[i] = result;
}
}
Strategy 3: Use Asynchronous Copy (Hopper)
#include <cuda/pipeline>
__global__ void async_copy_kernel(const float* gmem, float* out, int n) {
__shared__ float smem[256];
auto pipe = cuda::make_pipeline();
// Stage 1: async copy from global to shared (uses TMA, no scoreboard)
pipe.producer_acquire();
cuda::memcpy_async(smem + threadIdx.x, gmem + blockIdx.x * 256 + threadIdx.x,
sizeof(float), pipe);
pipe.producer_commit();
// Stage 2: compute on previously loaded data (overlaps with Stage 1 copy)
// ... computation here runs while copy completes ...
// Stage 3: wait for copy to complete
pipe.consumer_wait();
// Stage 4: use the data
out[blockIdx.x * 256 + threadIdx.x] = smem[threadIdx.x] * 2.0f;
pipe.consumer_release();
}
Strategy 4: Tune Block Size for Occupancy
// Query occupancy for different block sizes
int min_grid_size, block_size;
cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size,
my_kernel, 0, 0);
// block_size is chosen to maximize occupancy given the kernel's register usage
// Or manually query:
int max_blocks_per_sm;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_sm,
my_kernel, 256, 0);
printf("Max blocks per SM with 256 threads: %d\n", max_blocks_per_sm);
printf("Occupancy: %.1f%%\n",
100.0f * max_blocks_per_sm * 256 / 1536); // 1536 max threads on Hopper
GEMM Kernel Performance vs Occupancy (H100)
(% of peak TFLOPS)Nsight Compute Warp Scheduler Section
Key Metrics to Monitor
# Comprehensive scheduler profiling
ncu --metrics \
smsp__issue_active.avg.pct_of_peak_sustained_active,\
smsp__inst_executed.avg.per_cycle_active,\
smsp__warps_active.avg,\
smsp__warps_eligible.avg.per_cycle_active,\
smsp__warps_active.avg.pct_of_peak_sustained_active \
-k my_kernel ./my_app
# Interpretation:
# issue_active: % of cycles where at least one instruction was issued
# inst_per_cycle: average instructions issued per active cycle
# warps_active: average number of warps resident on each SM
# warps_eligible: average number of warps eligible for issue per cycle
# If warps_eligible << warps_active, most warps are stalled
Scheduler Health Indicators
| Metric | Healthy Value | Unhealthy Value | Action |
|---|---|---|---|
| issue_active | Above 80% | Below 50% | Increase occupancy or ILP |
| warps_eligible / warps_active | Above 0.5 | Below 0.1 | Reduce memory stalls (tiling, caching) |
| inst_per_cycle | Close to 1.0 | Below 0.5 | Scheduler cannot find work — increase parallelism |
| achieved_occupancy | Above 50% | Below 25% | Reduce register usage or shared memory |
| stall_long_scoreboard % | Below 20% | Above 40% | Kernel is memory-bound — tile or prefetch |
| replay_ratio | 1.0 | Above 2.0 | Fix uncoalesced accesses or bank conflicts |
Summary: The Scheduling Hierarchy
The GPU scheduling hierarchy operates at three levels:
-
Grid level: The GigaThread engine distributes thread blocks to SMs. Blocks are assigned to SMs that have sufficient resources (registers, shared memory, warp slots).
-
SM level: Each SM has 4 warp schedulers (one per partition). Warps from multiple blocks are interleaved across partitions.
-
Warp level: Each scheduler selects one warp per cycle using GTO policy. The scoreboard tracks register dependencies. Stalled warps are skipped at zero cost.
The entire system is designed around one principle: keep execution units busy by maintaining a large pool of independent work. When a warp stalls on memory (500 cycles), 11 other warps in that partition can fill the gap. When a block hits a barrier, other blocks on the same SM continue. When a kernel underutilizes one SM, the GigaThread engine launches new blocks on remaining SMs.
The practical consequence for kernel developers: measure stall reasons with Nsight Compute, identify whether the kernel is memory-bound or compute-bound, and apply the appropriate strategy — tiling and caching for memory-bound kernels, reducing occupancy constraints for compute-bound kernels, and always measuring the actual utilization against the theoretical model.