Part of Series CUDA Kernel Engineering 22 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

A Llama-7B decode step launches 180 CUDA kernels. At 8 microseconds per launch, that is 1.44 milliseconds of pure CPU-side overhead before the GPU computes a single token. For batch size 1 decode where the actual compute takes 3 milliseconds, launch overhead consumes 32% of total latency. Persistent kernels eliminate this by launching once at session start and running continuously, pulling work items from a device-side queue. Thread blocks never exit. They loop, consuming operations as they arrive. TensorRT-LLM’s persistent attention kernel reduces decode latency by 25-30% compared to traditional launch-per-operation — purely by eliminating launch tax.

Persistent kernels invert the traditional model. Instead of launching many short-lived kernels, you launch a single kernel whose thread blocks run for the entire inference session. This eliminates kernel launch overhead entirely and enables direct device-side scheduling of work to SMs.

The Kernel Launch Overhead Problem

Anatomy of a Kernel Launch

Every kernel<<<grid, block, smem, stream>>>() call triggers this sequence:

// Host-side overhead (CPU):
// 1. Validate launch parameters (grid, block, shared memory)    ~0.5 us
// 2. Copy kernel arguments to device-accessible memory           ~0.5 us
// 3. Submit command to the CUDA driver command queue             ~1-3 us
// 4. Driver enqueues command in the GPU command buffer           ~1-2 us
// Total host-side: ~3-7 us per launch

// Device-side overhead (GPU):
// 1. Command processor reads command from queue                  ~1-2 us
// 2. GigaThread engine schedules CTAs to SMs                     ~1-3 us
// 3. SM loads kernel binary, allocates registers/shared memory   ~0.5-1 us
// Total device-side: ~2-6 us per launch

// Combined: 5-13 us per kernel launch
// For a model with 32 layers, ~7 kernels per layer = 224 launches
// At 10 us average: 224 * 10 us = 2.24 ms just in launch overhead
// For a decode step that takes 8 ms total, that is 28% overhead!

CUDA Graphs: Partial Solution

CUDA Graphs pre-record a sequence of kernel launches and replay them in a single submission:

// CUDA Graph captures a sequence of operations
cudaStream_t stream;
cudaStreamCreate(&stream);

// Begin capture
cudaGraph_t graph;
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

// Record kernels (they don't execute yet)
kernel_A<<<gridA, blockA, 0, stream>>>(args_A);
kernel_B<<<gridB, blockB, 0, stream>>>(args_B);
kernel_C<<<gridC, blockC, 0, stream>>>(args_C);

// End capture
cudaStreamEndCapture(stream, &graph);

// Instantiate executable graph
cudaGraphExec_t graphExec;
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);

// Replay: all three kernels launched with minimal overhead
// Host overhead: ~5 us for the entire graph (not per kernel)
cudaGraphLaunch(graphExec, stream);

CUDA Graphs reduce host-side overhead but have limitations:

CUDA Graph limitations for inference serving:
1. Fixed topology: cannot change the graph structure per request
   (e.g., different sequence lengths require different graph instances)
2. Fixed arguments: kernel arguments must be updated via API calls
   before each replay (cudaGraphExecKernelNodeSetParams)
3. No conditional execution: cannot skip kernels based on runtime data
4. Memory: each graph instance allocates its own memory
5. Cannot mix with non-graph operations in the same stream

Persistent Kernel Architecture

The Core Pattern

A persistent kernel is a single kernel launch whose thread blocks loop indefinitely, consuming work items from a device-side queue:

// Device-side work queue structure
struct WorkItem {
    enum class OpType : uint32_t {
        GEMM = 0,
        LAYERNORM = 1,
        ATTENTION = 2,
        SOFTMAX = 3,
        EXIT = 0xFFFFFFFF
    };

    OpType type;
    uint32_t batch_size;
    uint32_t seq_len;
    uint32_t hidden_dim;

    // Pointers to input/output tensors (device pointers)
    void* input;
    void* weight;
    void* output;
    float* params;  // Additional parameters (scale, bias, etc.)
};

struct WorkQueue {
    WorkItem items[MAX_QUEUE_SIZE];
    volatile uint32_t head;     // Written by host (producer)
    volatile uint32_t tail;     // Written by device (consumer)
    volatile uint32_t shutdown;  // Signal to exit
};

// Persistent kernel: runs for the lifetime of the inference session
__global__ void persistent_inference_kernel(WorkQueue* queue) {
    int cta_id = blockIdx.x;
    int num_ctas = gridDim.x;

    while (true) {
        // Step 1: CTA 0 checks for new work
        __shared__ WorkItem current_item;
        __shared__ bool has_work;

        if (threadIdx.x == 0) {
            has_work = false;

            // Spin-wait for new work item
            // Each CTA claims a work item using atomic increment
            uint32_t my_slot;
            do {
                my_slot = atomicAdd(
                    (uint32_t*)&queue->tail, 0  // Read current tail
                );
                // Check if there is work available
                uint32_t head = __ldg(&queue->head);  // Read head (host-written)
                if (my_slot < head) {
                    // Try to claim this slot
                    uint32_t claimed = atomicCAS(
                        (uint32_t*)&queue->tail,
                        my_slot,
                        my_slot + 1
                    );
                    if (claimed == my_slot) {
                        current_item = queue->items[my_slot % MAX_QUEUE_SIZE];
                        has_work = true;
                        break;
                    }
                }

                // Check for shutdown
                if (__ldg(&queue->shutdown)) {
                    return;
                }

                // Yield: avoid burning power on spin-wait
                __nanosleep(100);  // Sleep 100 ns
            } while (true);
        }

        __syncthreads();

        if (!has_work) continue;
        if (current_item.type == WorkItem::OpType::EXIT) return;

        // Step 2: Dispatch based on operation type
        switch (current_item.type) {
            case WorkItem::OpType::GEMM:
                persistent_gemm(current_item, cta_id, num_ctas);
                break;
            case WorkItem::OpType::LAYERNORM:
                persistent_layernorm(current_item, cta_id, num_ctas);
                break;
            case WorkItem::OpType::ATTENTION:
                persistent_attention(current_item, cta_id, num_ctas);
                break;
            default:
                break;
        }

        __syncthreads();  // Ensure all threads finish before next item
    }
}
⚠️ Warning

The spin-wait in the persistent kernel must be carefully designed to avoid consuming excessive power and causing thermal throttling. Using __nanosleep() (Ampere and later) or a busy-wait with back-off prevents the SM from burning power while waiting. On older architectures without __nanosleep(), a __threadfence_system() can serve as a yield point.

Work Distribution Among CTAs

For operations that require multiple CTAs (like a large GEMM), the persistent kernel must distribute tiles across the available CTAs:

// Persistent GEMM: distribute tiles across persistent CTAs
__device__ void persistent_gemm(
    const WorkItem& item,
    int cta_id,
    int num_ctas
) {
    int M = item.batch_size * item.seq_len;
    int N = item.hidden_dim;
    int K = item.hidden_dim;

    // Tile sizes
    constexpr int TILE_M = 128;
    constexpr int TILE_N = 128;

    int tiles_m = (M + TILE_M - 1) / TILE_M;
    int tiles_n = (N + TILE_N - 1) / TILE_N;
    int total_tiles = tiles_m * tiles_n;

    // Each CTA processes tiles in a strided pattern
    // CTA 0: tiles 0, num_ctas, 2*num_ctas, ...
    // CTA 1: tiles 1, 1+num_ctas, 1+2*num_ctas, ...
    for (int tile_idx = cta_id; tile_idx < total_tiles; tile_idx += num_ctas) {
        int tile_m = tile_idx / tiles_n;
        int tile_n = tile_idx % tiles_n;

        // Compute this tile: C[tile_m, tile_n] = A[tile_m, :] * B[:, tile_n]
        compute_gemm_tile(
            (half*)item.input, (half*)item.weight, (half*)item.output,
            M, N, K,
            tile_m * TILE_M, tile_n * TILE_N,
            TILE_M, TILE_N
        );
    }

    // Barrier: all CTAs must finish before the work item is complete
    // Use a grid-wide barrier (cooperative groups or atomic counter)
    __shared__ int done;
    if (threadIdx.x == 0) {
        int prev = atomicAdd(&global_tile_counter, 1);
        done = (prev == total_tiles - 1);
        if (done) {
            // Last CTA to finish: signal completion
            atomicExch(&global_tile_counter, 0);  // Reset for next item
        }
    }
    __syncthreads();

    // Only the last CTA to finish signals the host
    if (done && threadIdx.x == 0) {
        // Write completion flag visible to host
        __threadfence_system();
    }
}

Host-Device Synchronization

Signaling Between Host and Device

The host produces work items and the device consumes them. This requires a lock-free producer-consumer queue with proper memory ordering:

// Host-side: enqueue work items
class PersistentInferenceEngine {
public:
    void enqueue_operation(WorkItem item) {
        // Write the work item to the queue
        uint32_t slot = current_head_ % MAX_QUEUE_SIZE;

        // Copy work item to device-visible memory
        // Using mapped/pinned memory for zero-copy access
        queue_->items[slot] = item;

        // Memory fence: ensure item is visible before updating head
        __sync_synchronize();  // Host-side memory barrier

        // Update head (atomic store visible to device)
        queue_->head = current_head_ + 1;
        current_head_++;
    }

    void wait_for_completion(uint32_t item_index) {
        // Spin on completion flag
        while (completion_flags_[item_index] == 0) {
            // Optionally yield CPU
            std::this_thread::yield();
        }
    }

private:
    WorkQueue* queue_;          // Mapped memory: accessible from both CPU and GPU
    uint32_t current_head_ = 0;
    volatile uint32_t* completion_flags_;  // Per-item completion signals
};

// Memory allocation for the queue
void setup_queue() {
    // Option 1: Mapped pinned memory (zero-copy)
    WorkQueue* host_queue;
    cudaHostAlloc(&host_queue, sizeof(WorkQueue),
                  cudaHostAllocMapped);
    WorkQueue* device_queue;
    cudaHostGetDevicePointer(&device_queue, host_queue, 0);

    // Option 2: Managed memory with system-scope atomics
    WorkQueue* managed_queue;
    cudaMallocManaged(&managed_queue, sizeof(WorkQueue));

    // Option 3: Separate host and device copies with explicit sync
    // (Fastest for high-throughput, most complex to implement)
}
Performance

The choice between mapped memory and managed memory affects latency. Mapped (pinned) memory provides the lowest host-to-device signaling latency (~1 microsecond on PCIe Gen5) because the GPU reads directly from host memory. Managed memory may incur page migration overhead. For the highest throughput, use device memory with explicit copies batched into the work items.

Persistent Kernels in Production Systems

TensorRT-LLM’s Persistent GEMM

TensorRT-LLM uses persistent thread blocks for GEMM operations in LLM inference, particularly for the decode phase where GEMMs are small and launch overhead dominates:

// Conceptual design of TensorRT-LLM's persistent GEMM approach

// Instead of launching a new GEMM kernel per layer:
// layer1_gemm<<<grid, block>>>(W1, X, Y1);
// layer2_gemm<<<grid, block>>>(W2, Y1, Y2);
// ...
// layer32_gemm<<<grid, block>>>(W32, Y31, Y32);
// (32 launches * ~10 us = 320 us overhead)

// Persistent approach: one kernel processes all GEMMs sequentially
__global__ void persistent_multi_layer_gemm(
    const LayerConfig* configs,  // Array of per-layer configurations
    int num_layers,
    void** weight_ptrs,          // Pointers to weight matrices
    void* workspace              // Shared workspace for intermediates
) {
    for (int layer = 0; layer < num_layers; layer++) {
        const LayerConfig& cfg = configs[layer];

        // All CTAs cooperate on this layer's GEMM
        compute_gemm(
            weight_ptrs[layer], workspace, workspace,
            cfg.M, cfg.N, cfg.K
        );

        // Grid-wide barrier: all CTAs must finish before next layer
        cooperative_groups::grid_group grid = cooperative_groups::this_grid();
        grid.sync();  // Requires cooperative launch
    }
}

// Launch with cooperative kernel launch (required for grid.sync())
void* args[] = {&configs, &num_layers, &weight_ptrs, &workspace};
cudaLaunchCooperativeKernel(
    (void*)persistent_multi_layer_gemm,
    grid_dim, block_dim,
    args, shared_mem_size, stream
);

FlashInfer: Persistent Attention for Continuous Batching

FlashInfer uses persistent kernels for paged attention in continuous batching, where different requests in a batch have different sequence lengths and KV cache layouts:

// FlashInfer's persistent attention concept
// Each CTA handles one or more attention heads for one request
// The work list changes dynamically as requests arrive and complete

struct AttentionWorkItem {
    int request_id;
    int head_idx;
    int q_offset;      // Offset into the query buffer
    int kv_page_table;  // Index into page table for this request's KV cache
    int seq_len;        // Current sequence length for this request
};

__global__ void persistent_paged_attention(
    AttentionWorkList* work_list,
    half* Q,           // All queries packed contiguously
    half** kv_pages,   // Page pool: array of page pointers
    int** page_tables, // Per-request page tables
    half* output
) {
    while (true) {
        // Claim a work item
        int item_idx = atomicAdd(&work_list->next_item, 1);
        if (item_idx >= work_list->num_items) break;

        AttentionWorkItem& item = work_list->items[item_idx];

        // Load query for this head
        half q[HEAD_DIM];
        load_query(Q, item.q_offset, item.head_idx, q);

        // Iterate over KV cache pages for this request
        float max_score = -INFINITY;
        float sum_exp = 0.0f;
        float output_accum[HEAD_DIM] = {0};

        int* page_table = page_tables[item.request_id];
        int num_pages = (item.seq_len + PAGE_SIZE - 1) / PAGE_SIZE;

        for (int page = 0; page < num_pages; page++) {
            int page_idx = page_table[page];
            half* kv_page = kv_pages[page_idx];

            // Compute attention scores for this page
            int tokens_in_page = min(PAGE_SIZE,
                item.seq_len - page * PAGE_SIZE);

            for (int t = 0; t < tokens_in_page; t += TILE_T) {
                // Load K, V from page
                // Compute Q*K^T, apply causal mask, update running softmax
                // Accumulate V weighted by attention scores
                flash_attention_tile(
                    q, kv_page, t, tokens_in_page,
                    &max_score, &sum_exp, output_accum
                );
            }
        }

        // Write final output
        store_output(output, item.q_offset, item.head_idx, output_accum, sum_exp);
    }
}

Grid-Wide Synchronization

Persistent kernels that process multi-CTA operations need grid-wide barriers. CUDA provides two mechanisms:

// Method 1: Cooperative Groups grid sync (clean, requires cooperative launch)
#include <cooperative_groups.h>

__global__ void persistent_with_grid_sync(int* work_counter) {
    namespace cg = cooperative_groups;
    cg::grid_group grid = cg::this_grid();

    while (true) {
        // Phase 1: Each CTA does its portion of work
        int my_tile = blockIdx.x;
        process_tile(my_tile);

        // Grid-wide barrier: all CTAs synchronize
        grid.sync();  // Requires ALL CTAs to reach this point

        // Phase 2: Post-processing (e.g., reduction)
        if (blockIdx.x == 0 && threadIdx.x == 0) {
            // Only one thread advances the work counter
        }
        grid.sync();
    }
}

// Cooperative launch requirement:
// Grid size must be <= number of SMs (so all CTAs can be resident simultaneously)
// This limits parallelism but guarantees forward progress

// Method 2: Atomic counter barrier (works with any grid size)
__device__ volatile int barrier_counter = 0;
__device__ volatile int barrier_flag = 0;

__device__ void grid_barrier(int num_ctas) {
    __syncthreads();  // Intra-CTA sync first

    if (threadIdx.x == 0) {
        int prev = atomicAdd((int*)&barrier_counter, 1);
        if (prev == num_ctas - 1) {
            // Last CTA to arrive: reset counter and flip flag
            barrier_counter = 0;
            __threadfence();
            atomicAdd((int*)&barrier_flag, 1);
        } else {
            // Wait for flag to change
            int expected = atomicAdd((int*)&barrier_flag, 0);  // Read current
            while (atomicAdd((int*)&barrier_flag, 0) == expected) {
                __nanosleep(32);
            }
        }
    }
    __syncthreads();
}
🚨 Danger

Grid-wide barriers with atomic counters are susceptible to deadlock if not all CTAs can be resident simultaneously. If the GPU schedules only a subset of CTAs, the barrier will never complete because the remaining CTAs cannot be launched until an SM is freed. Always ensure that the grid size is at most equal to the number of CTAs that can be simultaneously resident (SM count times maximum CTAs per SM).

Performance Analysis

Launch Overhead Elimination

# Measuring the impact of persistent kernels on launch overhead

# Standard approach: 224 kernel launches per decode step
# Average launch overhead: 8 us per launch
# Total overhead: 224 * 8 = 1,792 us = 1.79 ms

# CUDA Graphs: 1 graph launch per decode step
# Graph launch overhead: ~15 us
# But graph update overhead: ~50 us (for dynamic batch size)
# Total overhead: 65 us

# Persistent kernel: 0 launches during steady state
# Initial launch: ~15 us (one-time)
# Per-step overhead: work queue enqueue ~2 us + completion poll ~1 us = 3 us
# Total overhead: 3 us per decode step

# Savings for Llama-2-7B decode at BS=1:
# Standard decode step: 8.5 ms
# - Compute: 6.7 ms
# - Launch overhead: 1.8 ms
# Persistent decode step: 6.7 ms + 0.003 ms = 6.7 ms
# Speedup: 8.5 / 6.7 = 1.27x (27% faster from zero launch overhead)

Decode Latency Breakdown: Standard vs CUDA Graph vs Persistent (Llama-7B, BS=1)

(ms)
Standard (compute) Actual GPU work
6.7 ms
Standard (launch overhead) 224 launches
1.8 ms
CUDA Graph (compute) Same compute
6.7 ms
CUDA Graph (overhead) 1 graph launch
0.065 ms
Persistent (compute) Same compute
6.7 ms
Persistent (overhead) Queue only
0.003 ms

When Persistent Kernels Are Not Beneficial

Cases where persistent kernels add complexity without benefit:

1. Large batch sizes (BS >= 64):
   - Individual kernels are long-running (>100 us each)
   - Launch overhead is < 5% of total time
   - CUDA Graphs handle this efficiently

2. Variable-length operations requiring different grid sizes:
   - Persistent kernel must use a fixed grid size
   - If some operations need 132 CTAs and others need 4,
     the 4-CTA operation wastes 128 idle CTAs

3. Multi-GPU tensor parallel:
   - NCCL operations between layers require host-side orchestration
   - Persistent kernel cannot call NCCL from device code
   - Must exit the persistent loop, call NCCL, re-enter

4. Debugging and profiling:
   - Nsight Compute cannot profile individual operations within
     a persistent kernel (it sees one long kernel)
   - Must add manual instrumentation (device-side counters)
📊

Persistent Kernel Benefit by Batch Size (Llama-2-7B, H100)

Batch SizeStandard LatencyPersistent LatencySpeedupLaunch % of Standard
BS=1 8.5 ms 6.7 ms 1.27x 21%
BS=4 10.2 ms 8.8 ms 1.16x 14%
BS=16 15.5 ms 14.4 ms 1.08x 7%
BS=64 38.0 ms 37.2 ms 1.02x 2%
BS=256 125 ms 124.5 ms 1.00x 0.4%

Implementation Considerations

Thread Block Count Selection

The number of persistent CTAs should match the maximum simultaneously resident CTAs on the GPU:

// Query the maximum number of resident CTAs
int max_resident_ctas(int kernel_function, int block_size, int smem_per_block) {
    int max_blocks_per_sm;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_blocks_per_sm,
        kernel_function,
        block_size,
        smem_per_block
    );

    int num_sms;
    cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);

    return max_blocks_per_sm * num_sms;
}

// H100: 132 SMs
// Typical persistent kernel: 256 threads, 48 KB shared memory
// Max CTAs per SM: 2 (limited by shared memory)
// Total persistent CTAs: 132 * 2 = 264
//
// Launch: persistent_kernel<<<264, 256, 48*1024>>>(queue);
// These 264 CTAs run for the entire inference session

Power Management

Persistent kernels keep the GPU at full power consumption even when no work is available. The __nanosleep() intrinsic reduces power during idle periods:

// Power-aware spin-wait
__device__ void wait_for_work(volatile uint32_t* flag) {
    int spins = 0;
    while (*flag == 0) {
        if (spins < 100) {
            // Short spin: busy wait (lowest latency)
            __threadfence();
            spins++;
        } else if (spins < 1000) {
            // Medium: light sleep
            __nanosleep(100);  // 100 ns
            spins++;
        } else {
            // Long wait: deep sleep to save power
            __nanosleep(10000);  // 10 us
        }
    }
}

// This adaptive back-off reduces idle power by ~60-80%
// while keeping wake-up latency under 10 us for the common case

Summary

Persistent kernels eliminate kernel launch overhead by keeping thread blocks alive for the duration of an inference session, consuming work items from a device-side queue. The benefit is proportional to the launch overhead fraction: at batch size 1, where launch overhead can be 20%+ of decode latency, persistent kernels yield 1.2-1.3x speedup. At large batch sizes where compute dominates, the benefit is negligible. The key implementation challenges are: grid-wide synchronization (requires cooperative launch or atomic barriers), host-device signaling (mapped memory or managed memory queues), power management during idle periods (__nanosleep back-off), and debugging (Nsight Compute sees a single kernel). Persistent kernels are most valuable in latency-sensitive single-request inference (chatbots, code completion) and are used in production by TensorRT-LLM and FlashInfer.