Part of Series GPU Hardware & AI Accelerators 27 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

The GigaThread engine on an H100 can distribute up to 108,544 thread blocks across 132 SMs—but only if your kernel configuration does not bottleneck on register pressure, shared memory, or warp count limits. Launch a kernel with 256 registers per thread and max occupancy drops from 48 warps per SM to 8 warps, leaving 83% of the SM idle while waiting for memory. Launch with 64 KB of shared memory and occupancy drops to 2 blocks per SM, wasting half the available slots. The hardware scheduler is relentlessly efficient at dispatching work to SMs—but it cannot fix a poorly configured kernel. Understanding the three-level scheduling hierarchy (GigaThread block distribution, SM warp schedulers, instruction dispatch) is the difference between a kernel that achieves 30% utilization and one that achieves 85%.

This post traces the complete scheduling path from kernel launch to instruction retirement: the GigaThread engine’s block distribution policy, SM-level warp schedulers and scoreboard tracking, instruction dispatch unit rules, occupancy-driven scheduling, and the practical implications for kernel configuration.

The GPU Scheduling Hierarchy

NVIDIA GPUs have a three-level scheduling hierarchy:

  1. GigaThread Engine (chip-level): Distributes thread blocks from pending grids to SMs.
  2. SM Warp Scheduler (SM-level): Selects which resident warps issue instructions each cycle.
  3. Instruction Dispatch Unit (sub-SM): Routes selected instructions to functional units (INT32, FP32, FP64, Tensor Cores, LD/ST, SFU).
                    ┌─────────────────────┐
                    │   GigaThread Engine  │
                    │  (Block Distributor) │
                    └──────────┬──────────┘

              ┌────────────────┼────────────────┐
              │                │                │
         ┌────┴────┐     ┌────┴────┐      ┌────┴────┐
         │  GPC 0  │     │  GPC 1  │      │  GPC N  │
         │ TPC 0-M │     │ TPC 0-M │      │ TPC 0-M │
         └────┬────┘     └────┬────┘      └────┬────┘
              │                │                │
         ┌────┴────┐     ┌────┴────┐      ┌────┴────┐
         │  SM 0   │     │  SM 2   │      │  SM 2N  │
         │ Warp    │     │ Warp    │      │ Warp    │
         │Scheduler│     │Scheduler│      │Scheduler│
         └─────────┘     └─────────┘      └─────────┘
📊

Scheduling Unit Counts by GPU Architecture

GPUGPCsTPCsSMsWarp Schedulers/SMTotal Warp Schedulers
V100 6 40 80 4 320
A100 8 56 108 4 432
H100 8 66 132 4 528
B200 12 84 168 4 672
Note: Each warp scheduler can issue one instruction per cycle. H100 with 132 SMs x 4 schedulers = 528 independent instruction issue slots per cycle.

GigaThread Engine: Block Distribution

The GigaThread engine is the global scheduler that assigns thread blocks to SMs. It maintains a queue of pending thread blocks from one or more kernel launches and distributes them in a round-robin-like fashion across available SMs.

Block Assignment Algorithm

The GigaThread engine follows a greedy assignment strategy:

  1. For each pending thread block, check if any SM has sufficient resources (registers, shared memory, warp slots) to accept it.
  2. Assign the block to the SM with the most available resources (or round-robin among equally available SMs).
  3. Repeat until all blocks are assigned or all SMs are at capacity.

The resources checked per-SM are:

Per-SM resource limits (H100):
  Max thread blocks:         32
  Max warps:                 64
  Max threads:               2048
  Register file:             65536 registers (256 KB)
  Shared memory:             Up to 228 KB (configurable)

A thread block is eligible for assignment to an SM only if all four resource constraints are satisfied simultaneously:

def can_assign_block(sm_state, block_config):
    """Check if an SM can accept another thread block."""
    threads_per_block = block_config['threads']
    warps_per_block = (threads_per_block + 31) // 32
    regs_per_block = block_config['regs_per_thread'] * threads_per_block
    smem_per_block = block_config['shared_mem_bytes']

    # Check all four resource constraints
    blocks_ok = sm_state['blocks_used'] + 1 <= 32
    warps_ok = sm_state['warps_used'] + warps_per_block <= 64
    regs_ok = sm_state['regs_used'] + regs_per_block <= 65536
    smem_ok = sm_state['smem_used'] + smem_per_block <= sm_state['smem_limit']

    return blocks_ok and warps_ok and regs_ok and smem_ok

# Example: Kernel with 256 threads, 32 regs/thread, 16KB shared memory
block = {'threads': 256, 'regs_per_thread': 32, 'shared_mem_bytes': 16384}
# Per block: 8 warps, 8192 registers, 16384 bytes smem
# Max blocks per SM: min(32, 64//8, 65536//8192, 228K//16K) = min(32, 8, 8, 14) = 8
Block Size Affects SM Utilization

The GigaThread engine cannot split a thread block across multiple SMs. If a block requires 48 warps, only one block fits per SM (64 warp limit), leaving 16 warp slots unused. Choosing block sizes that evenly divide into the SM’s warp capacity maximizes occupancy.

Multi-Kernel Scheduling (MPS and MIG)

Starting with Volta, the GigaThread engine supports concurrent kernel execution through two mechanisms:

CUDA Streams: Multiple kernels from the same process can execute concurrently if they are in different streams and the GPU has available SMs.

Multi-Process Service (MPS): Multiple CUDA processes share the GPU. The GigaThread engine partitions SM resources across processes based on their active thread block count.

Multi-Instance GPU (MIG): Available on A100/H100, MIG physically partitions the GPU into isolated instances with dedicated SMs, memory controllers, and L2 cache. Each instance has its own GigaThread engine.

// Concurrent kernel execution via streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// These kernels can execute simultaneously on different SMs
kernel_A<<<grid_a, block_a, 0, stream1>>>(args_a);
kernel_B<<<grid_b, block_b, 0, stream2>>>(args_b);

// GigaThread distributes blocks from both kernels
// kernel_A blocks go to SMs 0-65
// kernel_B blocks go to SMs 66-131
// (actual distribution depends on resource availability)
📊

MIG Partition Configurations (H100 SXM)

Instance TypeSMsMemoryL2 CacheMem BWUse Case
1g.10gb 16 10 GB 5 MB 400 GB/s Small inference
2g.20gb 33 20 GB 10 MB 800 GB/s Medium inference
3g.40gb 50 40 GB 20 MB 1200 GB/s Large inference
4g.40gb 66 40 GB 20 MB 1600 GB/s Training + inference
7g.80gb 132 80 GB 50 MB 3350 GB/s Full GPU (no MIG)
Note: MIG partitions are isolated at the hardware level. A crash in one instance cannot affect others.

SM-Level Warp Scheduling

Once thread blocks are assigned to an SM, the SM’s warp schedulers take over. Each SM on Hopper has four warp schedulers, each capable of issuing one instruction per clock cycle.

Warp States

Every warp on an SM is in one of three states:

  1. Eligible: The warp’s next instruction has all operands ready and the required functional unit is available.
  2. Stalled: The warp is waiting for data (memory load), a synchronization barrier (__syncthreads()), or a functional unit (pipeline full).
  3. Inactive: The warp has completed execution or has not been launched yet.
Warp Lifecycle on SM:
  [Assigned] → [Active/Stalled] → [Eligible] → [Issued] → [Active/Stalled] → ...
                     ↑                                           |
                     └───────────── stall resolved ──────────────┘

Scheduling Policy

NVIDIA does not publicly document the exact scheduling policy, but empirical analysis reveals a priority-based scheme:

  1. Scoreboard check: Each warp has a scoreboard tracking which registers have pending writes. An instruction is eligible only when all source registers are ready.
  2. Priority selection: Among eligible warps, the scheduler uses a combination of:
    • Oldest-first (prefer warps that have been stalled longest)
    • Round-robin fairness (prevent starvation)
    • Bank-conflict avoidance (prefer warps whose instructions access different register banks)
// Pseudocode for warp scheduler logic (simplified)
struct WarpScheduler {
    static constexpr int MAX_WARPS_PER_SCHEDULER = 16;
    // Each of 4 schedulers handles up to 16 warps (64 total per SM)

    WarpState warps[MAX_WARPS_PER_SCHEDULER];
    int last_issued_idx = 0;

    int select_warp() {
        // Phase 1: Find all eligible warps
        int eligible[MAX_WARPS_PER_SCHEDULER];
        int num_eligible = 0;

        for (int i = 0; i < MAX_WARPS_PER_SCHEDULER; i++) {
            if (warps[i].active && warps[i].scoreboard_ready()) {
                eligible[num_eligible++] = i;
            }
        }

        if (num_eligible == 0) return -1;  // No warp to issue

        // Phase 2: Priority selection (oldest-first with round-robin tiebreak)
        int best = eligible[0];
        int best_age = warps[eligible[0]].stall_cycles;

        for (int i = 1; i < num_eligible; i++) {
            int idx = eligible[i];
            if (warps[idx].stall_cycles > best_age) {
                best = idx;
                best_age = warps[idx].stall_cycles;
            }
        }

        return best;
    }
};

Latency Hiding Through Warp Switching

The fundamental insight of GPU scheduling is latency hiding. When a warp stalls on a memory access (300-500 cycles for global memory on H100), the scheduler immediately switches to another eligible warp — at zero cost. There is no context switch overhead because all warps have their registers resident on the SM simultaneously.

The number of warps needed to fully hide memory latency is:

Warps needed=Memory latency (cycles)Instructions per warp between memory ops\text{Warps needed} = \frac{\text{Memory latency (cycles)}}{\text{Instructions per warp between memory ops}}

For a memory-bound kernel with one global load every 4 arithmetic instructions, each taking 1 cycle, and a 400-cycle memory latency:

Warps needed=4004=100 warps\text{Warps needed} = \frac{400}{4} = 100 \text{ warps}

Since each SM supports only 64 warps, you cannot fully hide the latency of a purely memory-bound kernel. This is why memory-bound kernels benefit from techniques like shared memory caching and data prefetching.

Kernel Throughput vs Active Warps per SM (Memory-Bound Kernel)

(% peak memory BW)
8 warps
22 % peak memory BW
16 warps
41 % peak memory BW
32 warps
68 % peak memory BW
48 warps
85 % peak memory BW
64 warps Max occupancy
92 % peak memory BW
ℹ️ Occupancy Is Not Always the Goal

Higher occupancy (more active warps per SM) generally improves latency hiding, but it also means each warp gets fewer registers. If reducing occupancy from 100% to 50% lets each thread use 2x more registers (avoiding register spills to local memory), performance can actually improve. The optimal occupancy depends on the kernel’s register pressure and memory access pattern.

Instruction Dispatch and Functional Units

Each warp scheduler connects to instruction dispatch units that route operations to the appropriate functional units within the SM.

H100 SM Functional Unit Layout

H100 SM (per SM):
  ┌──────────────────────────────────────────────┐
  │ Warp Scheduler 0  │ Warp Scheduler 1         │
  │ Warp Scheduler 2  │ Warp Scheduler 3         │
  ├──────────────────────────────────────────────┤
  │ 128x FP32 cores    (4 partitions x 32)       │
  │  64x FP64 cores    (4 partitions x 16)       │
  │  64x INT32 cores   (4 partitions x 16)       │
  │   4x Tensor Cores  (4th gen, FP8/FP16/BF16)  │
  │  32x LD/ST units   (4 partitions x 8)        │
  │   4x SFU           (sin, cos, rsqrt, etc)    │
  ├──────────────────────────────────────────────┤
  │ 256 KB Register File                          │
  │ 228 KB Shared Memory / L1 Cache               │
  │  32 KB Constant Cache                         │
  │  12 KB Texture Cache                          │
  └──────────────────────────────────────────────┘

A single warp of 32 threads executing an FP32 add requires 32 FP32 cores. With 128 FP32 cores per SM (organized as 4 x 32), the SM can execute one FP32 warp instruction per cycle per scheduler. Since there are 4 schedulers, the SM can issue up to 4 FP32 warp instructions per cycle — but only if all 4 schedulers have eligible warps with FP32 instructions ready.

// Instruction throughput calculation
// H100 SM: 128 FP32 cores at 1830 MHz boost
// Per-SM FP32 throughput: 128 * 2 * 1.83 GHz = 468.5 GFLOPS
// (factor of 2 for fused multiply-add)
// Total GPU: 132 SMs * 468.5 = 61,840 GFLOPS ≈ 62 TFLOPS

// Tensor Core throughput:
// 4th gen TC: 256 FP16 ops/cycle per TC, 4 TCs per SM
// Per-SM: 4 * 256 * 2 * 1.83 GHz = 3750 GFLOPS (with sparsity: 7500)
// Total: 132 * 3750 = 495 TFLOPS (dense), 990 TFLOPS (sparse)

Dual Issue

On Hopper, certain instruction pairs can be dual-issued within a single warp scheduler:

  • FP32 + INT32 (simultaneous execution on separate datapaths)
  • Tensor Core + LD/ST (overlap compute with memory)
  • FP32 + LD/ST

This dual-issue capability is why kernels that interleave arithmetic and memory instructions often perform better than kernels that batch all computation followed by all memory operations.

📊

Instruction Latency and Throughput (H100 SM)

InstructionLatency (cycles)Throughput (ops/cycle/SM)PipeliningNotes
FP32 ADD/MUL 4 128 Fully pipelined 1 warp/cycle/scheduler
FP32 FMA 4 128 Fully pipelined Counts as 2 FLOPS
INT32 ADD 4 64 Fully pipelined Shares datapath
FP64 FMA 8 64 Fully pipelined Half rate of FP32
Tensor Core HMMA 16-32 1024 (FP16) Pipelined Matrix 16x8x16
Shared memory load 22-30 32 per LD/ST unit Pipelined Bank conflicts add latency
Global memory load 300-500 32 per LD/ST unit Pipelined L2 hit: ~200 cycles
__syncthreads() 5-20 N/A Barrier All warps in block must arrive
Note: Throughput is per SM. Latency is from issue to result availability. Memory latencies assume L1 miss.

Thread Block Scheduling Strategies

The order in which the GigaThread engine assigns thread blocks to SMs affects data locality, cache behavior, and load balancing.

Linear vs Round-Robin Assignment

The default assignment is roughly linear: blocks 0-7 go to SM 0, blocks 8-15 go to SM 1, and so on (assuming 8 blocks per SM). This has locality benefits — consecutive thread blocks often access consecutive memory regions, and placing them on the same SM enables shared memory reuse.

However, linear assignment can cause load imbalance when thread blocks have variable execution time:

import numpy as np

def simulate_scheduling(num_blocks, num_sms, blocks_per_sm, exec_times):
    """Simulate block scheduling and compute SM utilization."""
    sm_finish_times = np.zeros(num_sms)
    sm_block_count = np.zeros(num_sms, dtype=int)

    for block_id in range(num_blocks):
        # Find SM with earliest finish time (greedy)
        target_sm = np.argmin(sm_finish_times)
        if sm_block_count[target_sm] < blocks_per_sm:
            sm_finish_times[target_sm] += exec_times[block_id]
            sm_block_count[target_sm] += 1
        else:
            # SM full, wait for any SM to free a slot
            target_sm = np.argmin(sm_finish_times)
            sm_finish_times[target_sm] += exec_times[block_id]

    total_time = np.max(sm_finish_times)
    avg_time = np.mean(sm_finish_times)
    utilization = avg_time / total_time
    return total_time, utilization

# Uniform execution times: perfect balance
uniform_times = np.ones(1056) * 100  # 1056 blocks, 100 us each
total, util = simulate_scheduling(1056, 132, 8, uniform_times)
print(f"Uniform: {total:.0f} us, {util:.1%} utilization")

# Variable execution times: imbalanced
variable_times = np.random.lognormal(4.6, 0.5, 1056)  # Mean ~100 us
total, util = simulate_scheduling(1056, 132, 8, variable_times)
print(f"Variable: {total:.0f} us, {util:.1%} utilization")

Persistent Kernels

Persistent kernels launch exactly one thread block per SM and loop internally to process multiple work items. This gives the kernel full control over scheduling:

__global__ void persistent_kernel(float* input, float* output,
                                   int* work_counter, int total_items) {
    // One block per SM -- the block persists for the entire kernel
    __shared__ int block_work_idx;

    while (true) {
        // Atomically claim the next work item
        if (threadIdx.x == 0) {
            block_work_idx = atomicAdd(work_counter, 1);
        }
        __syncthreads();

        int work_idx = block_work_idx;
        if (work_idx >= total_items) break;  // All work done

        // Process work item
        process_item(input, output, work_idx, threadIdx.x, blockDim.x);
        __syncthreads();
    }
}

// Launch exactly SM_COUNT blocks
int sm_count;
cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, 0);
persistent_kernel<<<sm_count, 256>>>(input, output, work_counter, total_items);
Persistent Kernels in LLM Inference

FlashAttention and many LLM serving kernels use persistent thread blocks. By keeping blocks resident on SMs, they avoid the overhead of GigaThread block distribution (which adds microseconds per wave of blocks) and maintain warm L1/shared memory caches across work items. The vLLM PagedAttention kernel uses this pattern to iterate over variable-length KV cache pages.

Occupancy Analysis in Practice

Occupancy is the ratio of active warps to the maximum warps an SM can support. Higher occupancy means more warps available for latency hiding.

// Query occupancy for a kernel
int block_size = 256;
int min_grid_size, optimal_block_size;

// API to find optimal block size for max occupancy
cudaOccupancyMaxPotentialBlockSize(
    &min_grid_size, &optimal_block_size,
    my_kernel, 0, 0);
printf("Optimal block size: %d\n", optimal_block_size);

// Query occupancy for specific block size
int max_active_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &max_active_blocks, my_kernel, block_size, 0);
int warps_per_block = (block_size + 31) / 32;
int active_warps = max_active_blocks * warps_per_block;
float occupancy = (float)active_warps / 64;  // 64 max warps per SM
printf("Occupancy: %.1f%% (%d active warps)\n", occupancy * 100, active_warps);

Register Pressure Trade-off

Each thread’s register usage directly impacts occupancy. The SM’s register file (65536 registers on H100) is shared among all active threads:

Max threads=min(2048,65536regs per thread)\text{Max threads} = \min\left(2048, \frac{65536}{\text{regs per thread}}\right)

📊

Register Usage vs Occupancy (H100 SM, 256 threads/block)

Regs/ThreadRegs/BlockMax Blocks/SMActive WarpsOccupancy
16 4096 16 (capped at 32 blocks limit, then warp limit: 8) 64 100%
32 8192 8 64 100%
48 12288 5 40 62.5%
64 16384 4 32 50%
96 24576 2 16 25%
128 32768 2 16 25%
255 65280 1 8 12.5%
Note: With 256 threads per block (8 warps), max blocks is min(floor(65536 / regs_per_block), floor(64 / 8), 32). Register count is the most common occupancy limiter.

Shared Memory Pressure

Similarly, shared memory limits how many blocks can be resident:

def compute_occupancy(threads_per_block, regs_per_thread,
                      smem_per_block, smem_limit=232448):
    """Compute theoretical occupancy on H100."""
    warps_per_block = (threads_per_block + 31) // 32
    regs_per_block = regs_per_thread * threads_per_block

    # Resource limits
    max_by_blocks = 32
    max_by_warps = 64 // warps_per_block
    max_by_regs = 65536 // regs_per_block if regs_per_block > 0 else 32
    max_by_smem = smem_limit // smem_per_block if smem_per_block > 0 else 32

    blocks_per_sm = min(max_by_blocks, max_by_warps, max_by_regs, max_by_smem)
    active_warps = blocks_per_sm * warps_per_block
    occupancy = active_warps / 64.0

    return {
        'blocks_per_sm': blocks_per_sm,
        'active_warps': active_warps,
        'occupancy': occupancy,
        'limiter': ['blocks', 'warps', 'regs', 'smem'][
            [max_by_blocks, max_by_warps, max_by_regs, max_by_smem].index(
                min(max_by_blocks, max_by_warps, max_by_regs, max_by_smem)
            )
        ]
    }

# FlashAttention-like kernel: 128 threads, 72 regs, 100KB smem
result = compute_occupancy(128, 72, 102400)
print(f"Blocks/SM: {result['blocks_per_sm']}, "
      f"Occupancy: {result['occupancy']:.1%}, "
      f"Limiter: {result['limiter']}")
# Blocks/SM: 2, Occupancy: 12.5%, Limiter: smem

FlashAttention Occupancy vs Shared Memory Usage

(% occupancy)
48 KB smem (4 blocks)
25 % occupancy
96 KB smem (2 blocks)
12.5 % occupancy
164 KB smem (1 block)
6.25 % occupancy
228 KB smem (1 block) Max smem, min occupancy
6.25 % occupancy

FlashAttention deliberately accepts low occupancy (1-2 blocks per SM) in exchange for keeping entire attention tiles in shared memory, avoiding repeated global memory reads. This is a case where the naive “maximize occupancy” advice would hurt performance.

Hardware Scheduling and Tail Latency

In inference serving, the last wave of thread blocks determines kernel tail latency. If your grid has NN blocks and BB blocks fit per SM with SS SMs, the number of waves is:

Waves=NB×S\text{Waves} = \left\lceil \frac{N}{B \times S} \right\rceil

The last wave may only partially fill the GPU. If the last wave has Nmod(B×S)N \mod (B \times S) blocks spread across SMs, those SMs finish while the rest sit idle.

def tail_effect(num_blocks, blocks_per_sm, num_sms):
    """Calculate tail effect overhead."""
    full_capacity = blocks_per_sm * num_sms
    num_waves = -(-num_blocks // full_capacity)  # Ceiling division
    last_wave_blocks = num_blocks - (num_waves - 1) * full_capacity
    last_wave_utilization = last_wave_blocks / full_capacity
    overall_efficiency = num_blocks / (num_waves * full_capacity)
    return {
        'waves': num_waves,
        'last_wave_blocks': last_wave_blocks,
        'last_wave_util': last_wave_utilization,
        'overall_efficiency': overall_efficiency
    }

# H100: 132 SMs, assume 4 blocks per SM = 528 block capacity
for num_blocks in [528, 529, 600, 1000, 1056]:
    r = tail_effect(num_blocks, 4, 132)
    print(f"Blocks={num_blocks}: waves={r['waves']}, "
          f"last_wave_util={r['last_wave_util']:.1%}, "
          f"efficiency={r['overall_efficiency']:.1%}")
# 528:  1 wave,  100%, 100%
# 529:  2 waves, 0.2%,  50.1%  -- terrible! 1 extra block costs 100% overhead
# 600:  2 waves, 13.6%, 56.8%
# 1000: 2 waves, 89.4%, 94.7%
# 1056: 2 waves, 100%, 100%
⚠️ The 529-Block Trap

Launching 529 blocks on a GPU that fits 528 per wave means the second wave runs a single block while 131 SMs sit idle. This almost doubles kernel execution time. Always check whether your grid size is a near-multiple of SM_count x blocks_per_SM, and pad or restructure to avoid partial last waves.

Measuring Scheduling Behavior

NVIDIA provides tools to observe scheduling decisions:

# Nsight Compute: warp scheduler statistics
ncu --metrics \
    smsp__warps_active.avg.per_cycle_active,\
    smsp__warps_eligible.avg.per_cycle_active,\
    smsp__issue_active.avg.per_cycle_active,\
    smsp__inst_executed.avg.per_cycle_active \
    --kernel-name "my_kernel" \
    ./my_program

# Key metrics:
# warps_active:   Average warps resident on SM (occupancy)
# warps_eligible:  Average warps ready to issue (scheduling headroom)
# issue_active:    Fraction of cycles with at least 1 warp issued
# inst_executed:   Average instructions issued per cycle

The ratio warps_eligible / warps_active tells you what fraction of resident warps are actually ready to execute. If this ratio is low (say, less than 20%), most warps are stalled and you need either more warps (higher occupancy) or fewer stalls (better memory access patterns).

📊

Scheduling Metrics for Common LLM Kernels (H100)

KernelActive WarpsEligible WarpsIssue RateStall Reason
GEMM (Tensor Core) 32 8-12 95%+ Register dependency
FlashAttention fwd 8 2-4 88% Shared memory bank conflict
LayerNorm 64 24-32 98% Memory latency (brief)
Softmax 48 16-24 92% Memory BW saturation
Elementwise (ReLU) 64 32-48 99% Minimal stalls
AllReduce (NCCL) 16 4-8 60% Network/NVLink latency
Note: Issue rate is the fraction of cycles where at least one instruction is dispatched. GEMM achieves near-100% issue rate despite moderate occupancy because Tensor Core instructions have high throughput.

Scheduling Implications for Kernel Design

Understanding hardware scheduling leads to concrete optimization strategies:

1. Match grid size to SM count. Use cudaGetDeviceProperties() to query SM count and size your grid as a multiple of SM_count * target_blocks_per_SM.

2. Balance register and shared memory usage. Use __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor) to hint the compiler toward the desired occupancy point.

// Tell the compiler to target 2 blocks per SM with 256 threads
__global__ __launch_bounds__(256, 2)
void my_kernel(float* data, int n) {
    // Compiler will use up to 65536 / (256 * 2) = 128 registers per thread
    // and will optimize for 2-block residency
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // ... computation
    }
}

3. Minimize synchronization barriers. Each __syncthreads() stalls all warps in a block until the slowest warp arrives. Reducing barrier frequency improves scheduler flexibility.

4. Interleave compute and memory. The scheduler can dual-issue compute + memory instructions. Structure your code to alternate between arithmetic and load/store operations.

5. Use cooperative groups for flexible synchronization. CUDA cooperative groups allow synchronization at granularities finer than the full thread block, giving the scheduler more eligible warps.

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void flexible_sync_kernel(float* data) {
    cg::thread_block block = cg::this_thread_block();
    cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);

    // Warp-level sync instead of block-level
    float val = data[threadIdx.x];
    val = cg::reduce(warp, val, cg::plus<float>());
    // Only 32 threads sync, other warps remain eligible
}

Summary

GPU hardware scheduling is a three-level system: the GigaThread engine distributes thread blocks to SMs, SM warp schedulers select eligible warps every cycle, and dispatch units route instructions to functional units. Performance depends on giving the scheduler enough eligible warps to hide latency (occupancy), avoiding resource bottlenecks (registers, shared memory), sizing grids to avoid partial last waves, and understanding when low occupancy is acceptable (as in FlashAttention). The scheduler is deterministic hardware — it cannot optimize your kernel for you, but understanding its behavior lets you write kernels that keep it fully utilized.