The GigaThread engine on an H100 can distribute up to 108,544 thread blocks across 132 SMs—but only if your kernel configuration does not bottleneck on register pressure, shared memory, or warp count limits. Launch a kernel with 256 registers per thread and max occupancy drops from 48 warps per SM to 8 warps, leaving 83% of the SM idle while waiting for memory. Launch with 64 KB of shared memory and occupancy drops to 2 blocks per SM, wasting half the available slots. The hardware scheduler is relentlessly efficient at dispatching work to SMs—but it cannot fix a poorly configured kernel. Understanding the three-level scheduling hierarchy (GigaThread block distribution, SM warp schedulers, instruction dispatch) is the difference between a kernel that achieves 30% utilization and one that achieves 85%.
This post traces the complete scheduling path from kernel launch to instruction retirement: the GigaThread engine’s block distribution policy, SM-level warp schedulers and scoreboard tracking, instruction dispatch unit rules, occupancy-driven scheduling, and the practical implications for kernel configuration.
The GPU Scheduling Hierarchy
NVIDIA GPUs have a three-level scheduling hierarchy:
- GigaThread Engine (chip-level): Distributes thread blocks from pending grids to SMs.
- SM Warp Scheduler (SM-level): Selects which resident warps issue instructions each cycle.
- Instruction Dispatch Unit (sub-SM): Routes selected instructions to functional units (INT32, FP32, FP64, Tensor Cores, LD/ST, SFU).
┌─────────────────────┐
│ GigaThread Engine │
│ (Block Distributor) │
└──────────┬──────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ GPC 0 │ │ GPC 1 │ │ GPC N │
│ TPC 0-M │ │ TPC 0-M │ │ TPC 0-M │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ SM 0 │ │ SM 2 │ │ SM 2N │
│ Warp │ │ Warp │ │ Warp │
│Scheduler│ │Scheduler│ │Scheduler│
└─────────┘ └─────────┘ └─────────┘
Scheduling Unit Counts by GPU Architecture
| GPU | GPCs | TPCs | SMs | Warp Schedulers/SM | Total Warp Schedulers |
|---|---|---|---|---|---|
| V100 | 6 | 40 | 80 | 4 | 320 |
| A100 | 8 | 56 | 108 | 4 | 432 |
| H100 | 8 | 66 | 132 | 4 | 528 |
| B200 | 12 | 84 | 168 | 4 | 672 |
GigaThread Engine: Block Distribution
The GigaThread engine is the global scheduler that assigns thread blocks to SMs. It maintains a queue of pending thread blocks from one or more kernel launches and distributes them in a round-robin-like fashion across available SMs.
Block Assignment Algorithm
The GigaThread engine follows a greedy assignment strategy:
- For each pending thread block, check if any SM has sufficient resources (registers, shared memory, warp slots) to accept it.
- Assign the block to the SM with the most available resources (or round-robin among equally available SMs).
- Repeat until all blocks are assigned or all SMs are at capacity.
The resources checked per-SM are:
Per-SM resource limits (H100):
Max thread blocks: 32
Max warps: 64
Max threads: 2048
Register file: 65536 registers (256 KB)
Shared memory: Up to 228 KB (configurable)
A thread block is eligible for assignment to an SM only if all four resource constraints are satisfied simultaneously:
def can_assign_block(sm_state, block_config):
"""Check if an SM can accept another thread block."""
threads_per_block = block_config['threads']
warps_per_block = (threads_per_block + 31) // 32
regs_per_block = block_config['regs_per_thread'] * threads_per_block
smem_per_block = block_config['shared_mem_bytes']
# Check all four resource constraints
blocks_ok = sm_state['blocks_used'] + 1 <= 32
warps_ok = sm_state['warps_used'] + warps_per_block <= 64
regs_ok = sm_state['regs_used'] + regs_per_block <= 65536
smem_ok = sm_state['smem_used'] + smem_per_block <= sm_state['smem_limit']
return blocks_ok and warps_ok and regs_ok and smem_ok
# Example: Kernel with 256 threads, 32 regs/thread, 16KB shared memory
block = {'threads': 256, 'regs_per_thread': 32, 'shared_mem_bytes': 16384}
# Per block: 8 warps, 8192 registers, 16384 bytes smem
# Max blocks per SM: min(32, 64//8, 65536//8192, 228K//16K) = min(32, 8, 8, 14) = 8
The GigaThread engine cannot split a thread block across multiple SMs. If a block requires 48 warps, only one block fits per SM (64 warp limit), leaving 16 warp slots unused. Choosing block sizes that evenly divide into the SM’s warp capacity maximizes occupancy.
Multi-Kernel Scheduling (MPS and MIG)
Starting with Volta, the GigaThread engine supports concurrent kernel execution through two mechanisms:
CUDA Streams: Multiple kernels from the same process can execute concurrently if they are in different streams and the GPU has available SMs.
Multi-Process Service (MPS): Multiple CUDA processes share the GPU. The GigaThread engine partitions SM resources across processes based on their active thread block count.
Multi-Instance GPU (MIG): Available on A100/H100, MIG physically partitions the GPU into isolated instances with dedicated SMs, memory controllers, and L2 cache. Each instance has its own GigaThread engine.
// Concurrent kernel execution via streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// These kernels can execute simultaneously on different SMs
kernel_A<<<grid_a, block_a, 0, stream1>>>(args_a);
kernel_B<<<grid_b, block_b, 0, stream2>>>(args_b);
// GigaThread distributes blocks from both kernels
// kernel_A blocks go to SMs 0-65
// kernel_B blocks go to SMs 66-131
// (actual distribution depends on resource availability)
MIG Partition Configurations (H100 SXM)
| Instance Type | SMs | Memory | L2 Cache | Mem BW | Use Case |
|---|---|---|---|---|---|
| 1g.10gb | 16 | 10 GB | 5 MB | 400 GB/s | Small inference |
| 2g.20gb | 33 | 20 GB | 10 MB | 800 GB/s | Medium inference |
| 3g.40gb | 50 | 40 GB | 20 MB | 1200 GB/s | Large inference |
| 4g.40gb | 66 | 40 GB | 20 MB | 1600 GB/s | Training + inference |
| 7g.80gb | 132 | 80 GB | 50 MB | 3350 GB/s | Full GPU (no MIG) |
SM-Level Warp Scheduling
Once thread blocks are assigned to an SM, the SM’s warp schedulers take over. Each SM on Hopper has four warp schedulers, each capable of issuing one instruction per clock cycle.
Warp States
Every warp on an SM is in one of three states:
- Eligible: The warp’s next instruction has all operands ready and the required functional unit is available.
- Stalled: The warp is waiting for data (memory load), a synchronization barrier (
__syncthreads()), or a functional unit (pipeline full). - Inactive: The warp has completed execution or has not been launched yet.
Warp Lifecycle on SM:
[Assigned] → [Active/Stalled] → [Eligible] → [Issued] → [Active/Stalled] → ...
↑ |
└───────────── stall resolved ──────────────┘
Scheduling Policy
NVIDIA does not publicly document the exact scheduling policy, but empirical analysis reveals a priority-based scheme:
- Scoreboard check: Each warp has a scoreboard tracking which registers have pending writes. An instruction is eligible only when all source registers are ready.
- Priority selection: Among eligible warps, the scheduler uses a combination of:
- Oldest-first (prefer warps that have been stalled longest)
- Round-robin fairness (prevent starvation)
- Bank-conflict avoidance (prefer warps whose instructions access different register banks)
// Pseudocode for warp scheduler logic (simplified)
struct WarpScheduler {
static constexpr int MAX_WARPS_PER_SCHEDULER = 16;
// Each of 4 schedulers handles up to 16 warps (64 total per SM)
WarpState warps[MAX_WARPS_PER_SCHEDULER];
int last_issued_idx = 0;
int select_warp() {
// Phase 1: Find all eligible warps
int eligible[MAX_WARPS_PER_SCHEDULER];
int num_eligible = 0;
for (int i = 0; i < MAX_WARPS_PER_SCHEDULER; i++) {
if (warps[i].active && warps[i].scoreboard_ready()) {
eligible[num_eligible++] = i;
}
}
if (num_eligible == 0) return -1; // No warp to issue
// Phase 2: Priority selection (oldest-first with round-robin tiebreak)
int best = eligible[0];
int best_age = warps[eligible[0]].stall_cycles;
for (int i = 1; i < num_eligible; i++) {
int idx = eligible[i];
if (warps[idx].stall_cycles > best_age) {
best = idx;
best_age = warps[idx].stall_cycles;
}
}
return best;
}
};
Latency Hiding Through Warp Switching
The fundamental insight of GPU scheduling is latency hiding. When a warp stalls on a memory access (300-500 cycles for global memory on H100), the scheduler immediately switches to another eligible warp — at zero cost. There is no context switch overhead because all warps have their registers resident on the SM simultaneously.
The number of warps needed to fully hide memory latency is:
For a memory-bound kernel with one global load every 4 arithmetic instructions, each taking 1 cycle, and a 400-cycle memory latency:
Since each SM supports only 64 warps, you cannot fully hide the latency of a purely memory-bound kernel. This is why memory-bound kernels benefit from techniques like shared memory caching and data prefetching.
Kernel Throughput vs Active Warps per SM (Memory-Bound Kernel)
(% peak memory BW)Higher occupancy (more active warps per SM) generally improves latency hiding, but it also means each warp gets fewer registers. If reducing occupancy from 100% to 50% lets each thread use 2x more registers (avoiding register spills to local memory), performance can actually improve. The optimal occupancy depends on the kernel’s register pressure and memory access pattern.
Instruction Dispatch and Functional Units
Each warp scheduler connects to instruction dispatch units that route operations to the appropriate functional units within the SM.
H100 SM Functional Unit Layout
H100 SM (per SM):
┌──────────────────────────────────────────────┐
│ Warp Scheduler 0 │ Warp Scheduler 1 │
│ Warp Scheduler 2 │ Warp Scheduler 3 │
├──────────────────────────────────────────────┤
│ 128x FP32 cores (4 partitions x 32) │
│ 64x FP64 cores (4 partitions x 16) │
│ 64x INT32 cores (4 partitions x 16) │
│ 4x Tensor Cores (4th gen, FP8/FP16/BF16) │
│ 32x LD/ST units (4 partitions x 8) │
│ 4x SFU (sin, cos, rsqrt, etc) │
├──────────────────────────────────────────────┤
│ 256 KB Register File │
│ 228 KB Shared Memory / L1 Cache │
│ 32 KB Constant Cache │
│ 12 KB Texture Cache │
└──────────────────────────────────────────────┘
A single warp of 32 threads executing an FP32 add requires 32 FP32 cores. With 128 FP32 cores per SM (organized as 4 x 32), the SM can execute one FP32 warp instruction per cycle per scheduler. Since there are 4 schedulers, the SM can issue up to 4 FP32 warp instructions per cycle — but only if all 4 schedulers have eligible warps with FP32 instructions ready.
// Instruction throughput calculation
// H100 SM: 128 FP32 cores at 1830 MHz boost
// Per-SM FP32 throughput: 128 * 2 * 1.83 GHz = 468.5 GFLOPS
// (factor of 2 for fused multiply-add)
// Total GPU: 132 SMs * 468.5 = 61,840 GFLOPS ≈ 62 TFLOPS
// Tensor Core throughput:
// 4th gen TC: 256 FP16 ops/cycle per TC, 4 TCs per SM
// Per-SM: 4 * 256 * 2 * 1.83 GHz = 3750 GFLOPS (with sparsity: 7500)
// Total: 132 * 3750 = 495 TFLOPS (dense), 990 TFLOPS (sparse)
Dual Issue
On Hopper, certain instruction pairs can be dual-issued within a single warp scheduler:
- FP32 + INT32 (simultaneous execution on separate datapaths)
- Tensor Core + LD/ST (overlap compute with memory)
- FP32 + LD/ST
This dual-issue capability is why kernels that interleave arithmetic and memory instructions often perform better than kernels that batch all computation followed by all memory operations.
Instruction Latency and Throughput (H100 SM)
| Instruction | Latency (cycles) | Throughput (ops/cycle/SM) | Pipelining | Notes |
|---|---|---|---|---|
| FP32 ADD/MUL | 4 | 128 | Fully pipelined | 1 warp/cycle/scheduler |
| FP32 FMA | 4 | 128 | Fully pipelined | Counts as 2 FLOPS |
| INT32 ADD | 4 | 64 | Fully pipelined | Shares datapath |
| FP64 FMA | 8 | 64 | Fully pipelined | Half rate of FP32 |
| Tensor Core HMMA | 16-32 | 1024 (FP16) | Pipelined | Matrix 16x8x16 |
| Shared memory load | 22-30 | 32 per LD/ST unit | Pipelined | Bank conflicts add latency |
| Global memory load | 300-500 | 32 per LD/ST unit | Pipelined | L2 hit: ~200 cycles |
| __syncthreads() | 5-20 | N/A | Barrier | All warps in block must arrive |
Thread Block Scheduling Strategies
The order in which the GigaThread engine assigns thread blocks to SMs affects data locality, cache behavior, and load balancing.
Linear vs Round-Robin Assignment
The default assignment is roughly linear: blocks 0-7 go to SM 0, blocks 8-15 go to SM 1, and so on (assuming 8 blocks per SM). This has locality benefits — consecutive thread blocks often access consecutive memory regions, and placing them on the same SM enables shared memory reuse.
However, linear assignment can cause load imbalance when thread blocks have variable execution time:
import numpy as np
def simulate_scheduling(num_blocks, num_sms, blocks_per_sm, exec_times):
"""Simulate block scheduling and compute SM utilization."""
sm_finish_times = np.zeros(num_sms)
sm_block_count = np.zeros(num_sms, dtype=int)
for block_id in range(num_blocks):
# Find SM with earliest finish time (greedy)
target_sm = np.argmin(sm_finish_times)
if sm_block_count[target_sm] < blocks_per_sm:
sm_finish_times[target_sm] += exec_times[block_id]
sm_block_count[target_sm] += 1
else:
# SM full, wait for any SM to free a slot
target_sm = np.argmin(sm_finish_times)
sm_finish_times[target_sm] += exec_times[block_id]
total_time = np.max(sm_finish_times)
avg_time = np.mean(sm_finish_times)
utilization = avg_time / total_time
return total_time, utilization
# Uniform execution times: perfect balance
uniform_times = np.ones(1056) * 100 # 1056 blocks, 100 us each
total, util = simulate_scheduling(1056, 132, 8, uniform_times)
print(f"Uniform: {total:.0f} us, {util:.1%} utilization")
# Variable execution times: imbalanced
variable_times = np.random.lognormal(4.6, 0.5, 1056) # Mean ~100 us
total, util = simulate_scheduling(1056, 132, 8, variable_times)
print(f"Variable: {total:.0f} us, {util:.1%} utilization")
Persistent Kernels
Persistent kernels launch exactly one thread block per SM and loop internally to process multiple work items. This gives the kernel full control over scheduling:
__global__ void persistent_kernel(float* input, float* output,
int* work_counter, int total_items) {
// One block per SM -- the block persists for the entire kernel
__shared__ int block_work_idx;
while (true) {
// Atomically claim the next work item
if (threadIdx.x == 0) {
block_work_idx = atomicAdd(work_counter, 1);
}
__syncthreads();
int work_idx = block_work_idx;
if (work_idx >= total_items) break; // All work done
// Process work item
process_item(input, output, work_idx, threadIdx.x, blockDim.x);
__syncthreads();
}
}
// Launch exactly SM_COUNT blocks
int sm_count;
cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, 0);
persistent_kernel<<<sm_count, 256>>>(input, output, work_counter, total_items);
FlashAttention and many LLM serving kernels use persistent thread blocks. By keeping blocks resident on SMs, they avoid the overhead of GigaThread block distribution (which adds microseconds per wave of blocks) and maintain warm L1/shared memory caches across work items. The vLLM PagedAttention kernel uses this pattern to iterate over variable-length KV cache pages.
Occupancy Analysis in Practice
Occupancy is the ratio of active warps to the maximum warps an SM can support. Higher occupancy means more warps available for latency hiding.
// Query occupancy for a kernel
int block_size = 256;
int min_grid_size, optimal_block_size;
// API to find optimal block size for max occupancy
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size, &optimal_block_size,
my_kernel, 0, 0);
printf("Optimal block size: %d\n", optimal_block_size);
// Query occupancy for specific block size
int max_active_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks, my_kernel, block_size, 0);
int warps_per_block = (block_size + 31) / 32;
int active_warps = max_active_blocks * warps_per_block;
float occupancy = (float)active_warps / 64; // 64 max warps per SM
printf("Occupancy: %.1f%% (%d active warps)\n", occupancy * 100, active_warps);
Register Pressure Trade-off
Each thread’s register usage directly impacts occupancy. The SM’s register file (65536 registers on H100) is shared among all active threads:
Register Usage vs Occupancy (H100 SM, 256 threads/block)
| Regs/Thread | Regs/Block | Max Blocks/SM | Active Warps | Occupancy |
|---|---|---|---|---|
| 16 | 4096 | 16 (capped at 32 blocks limit, then warp limit: 8) | 64 | 100% |
| 32 | 8192 | 8 | 64 | 100% |
| 48 | 12288 | 5 | 40 | 62.5% |
| 64 | 16384 | 4 | 32 | 50% |
| 96 | 24576 | 2 | 16 | 25% |
| 128 | 32768 | 2 | 16 | 25% |
| 255 | 65280 | 1 | 8 | 12.5% |
Shared Memory Pressure
Similarly, shared memory limits how many blocks can be resident:
def compute_occupancy(threads_per_block, regs_per_thread,
smem_per_block, smem_limit=232448):
"""Compute theoretical occupancy on H100."""
warps_per_block = (threads_per_block + 31) // 32
regs_per_block = regs_per_thread * threads_per_block
# Resource limits
max_by_blocks = 32
max_by_warps = 64 // warps_per_block
max_by_regs = 65536 // regs_per_block if regs_per_block > 0 else 32
max_by_smem = smem_limit // smem_per_block if smem_per_block > 0 else 32
blocks_per_sm = min(max_by_blocks, max_by_warps, max_by_regs, max_by_smem)
active_warps = blocks_per_sm * warps_per_block
occupancy = active_warps / 64.0
return {
'blocks_per_sm': blocks_per_sm,
'active_warps': active_warps,
'occupancy': occupancy,
'limiter': ['blocks', 'warps', 'regs', 'smem'][
[max_by_blocks, max_by_warps, max_by_regs, max_by_smem].index(
min(max_by_blocks, max_by_warps, max_by_regs, max_by_smem)
)
]
}
# FlashAttention-like kernel: 128 threads, 72 regs, 100KB smem
result = compute_occupancy(128, 72, 102400)
print(f"Blocks/SM: {result['blocks_per_sm']}, "
f"Occupancy: {result['occupancy']:.1%}, "
f"Limiter: {result['limiter']}")
# Blocks/SM: 2, Occupancy: 12.5%, Limiter: smem
FlashAttention Occupancy vs Shared Memory Usage
(% occupancy)FlashAttention deliberately accepts low occupancy (1-2 blocks per SM) in exchange for keeping entire attention tiles in shared memory, avoiding repeated global memory reads. This is a case where the naive “maximize occupancy” advice would hurt performance.
Hardware Scheduling and Tail Latency
In inference serving, the last wave of thread blocks determines kernel tail latency. If your grid has blocks and blocks fit per SM with SMs, the number of waves is:
The last wave may only partially fill the GPU. If the last wave has blocks spread across SMs, those SMs finish while the rest sit idle.
def tail_effect(num_blocks, blocks_per_sm, num_sms):
"""Calculate tail effect overhead."""
full_capacity = blocks_per_sm * num_sms
num_waves = -(-num_blocks // full_capacity) # Ceiling division
last_wave_blocks = num_blocks - (num_waves - 1) * full_capacity
last_wave_utilization = last_wave_blocks / full_capacity
overall_efficiency = num_blocks / (num_waves * full_capacity)
return {
'waves': num_waves,
'last_wave_blocks': last_wave_blocks,
'last_wave_util': last_wave_utilization,
'overall_efficiency': overall_efficiency
}
# H100: 132 SMs, assume 4 blocks per SM = 528 block capacity
for num_blocks in [528, 529, 600, 1000, 1056]:
r = tail_effect(num_blocks, 4, 132)
print(f"Blocks={num_blocks}: waves={r['waves']}, "
f"last_wave_util={r['last_wave_util']:.1%}, "
f"efficiency={r['overall_efficiency']:.1%}")
# 528: 1 wave, 100%, 100%
# 529: 2 waves, 0.2%, 50.1% -- terrible! 1 extra block costs 100% overhead
# 600: 2 waves, 13.6%, 56.8%
# 1000: 2 waves, 89.4%, 94.7%
# 1056: 2 waves, 100%, 100%
Launching 529 blocks on a GPU that fits 528 per wave means the second wave runs a single block while 131 SMs sit idle. This almost doubles kernel execution time. Always check whether your grid size is a near-multiple of SM_count x blocks_per_SM, and pad or restructure to avoid partial last waves.
Measuring Scheduling Behavior
NVIDIA provides tools to observe scheduling decisions:
# Nsight Compute: warp scheduler statistics
ncu --metrics \
smsp__warps_active.avg.per_cycle_active,\
smsp__warps_eligible.avg.per_cycle_active,\
smsp__issue_active.avg.per_cycle_active,\
smsp__inst_executed.avg.per_cycle_active \
--kernel-name "my_kernel" \
./my_program
# Key metrics:
# warps_active: Average warps resident on SM (occupancy)
# warps_eligible: Average warps ready to issue (scheduling headroom)
# issue_active: Fraction of cycles with at least 1 warp issued
# inst_executed: Average instructions issued per cycle
The ratio warps_eligible / warps_active tells you what fraction of resident warps are actually ready to execute. If this ratio is low (say, less than 20%), most warps are stalled and you need either more warps (higher occupancy) or fewer stalls (better memory access patterns).
Scheduling Metrics for Common LLM Kernels (H100)
| Kernel | Active Warps | Eligible Warps | Issue Rate | Stall Reason |
|---|---|---|---|---|
| GEMM (Tensor Core) | 32 | 8-12 | 95%+ | Register dependency |
| FlashAttention fwd | 8 | 2-4 | 88% | Shared memory bank conflict |
| LayerNorm | 64 | 24-32 | 98% | Memory latency (brief) |
| Softmax | 48 | 16-24 | 92% | Memory BW saturation |
| Elementwise (ReLU) | 64 | 32-48 | 99% | Minimal stalls |
| AllReduce (NCCL) | 16 | 4-8 | 60% | Network/NVLink latency |
Scheduling Implications for Kernel Design
Understanding hardware scheduling leads to concrete optimization strategies:
1. Match grid size to SM count. Use cudaGetDeviceProperties() to query SM count and size your grid as a multiple of SM_count * target_blocks_per_SM.
2. Balance register and shared memory usage. Use __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor) to hint the compiler toward the desired occupancy point.
// Tell the compiler to target 2 blocks per SM with 256 threads
__global__ __launch_bounds__(256, 2)
void my_kernel(float* data, int n) {
// Compiler will use up to 65536 / (256 * 2) = 128 registers per thread
// and will optimize for 2-block residency
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
// ... computation
}
}
3. Minimize synchronization barriers. Each __syncthreads() stalls all warps in a block until the slowest warp arrives. Reducing barrier frequency improves scheduler flexibility.
4. Interleave compute and memory. The scheduler can dual-issue compute + memory instructions. Structure your code to alternate between arithmetic and load/store operations.
5. Use cooperative groups for flexible synchronization. CUDA cooperative groups allow synchronization at granularities finer than the full thread block, giving the scheduler more eligible warps.
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void flexible_sync_kernel(float* data) {
cg::thread_block block = cg::this_thread_block();
cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
// Warp-level sync instead of block-level
float val = data[threadIdx.x];
val = cg::reduce(warp, val, cg::plus<float>());
// Only 32 threads sync, other warps remain eligible
}
Summary
GPU hardware scheduling is a three-level system: the GigaThread engine distributes thread blocks to SMs, SM warp schedulers select eligible warps every cycle, and dispatch units route instructions to functional units. Performance depends on giving the scheduler enough eligible warps to hide latency (occupancy), avoiding resource bottlenecks (registers, shared memory), sizing grids to avoid partial last waves, and understanding when low occupancy is acceptable (as in FlashAttention). The scheduler is deterministic hardware — it cannot optimize your kernel for you, but understanding its behavior lets you write kernels that keep it fully utilized.