Part of Series CUDA Kernel Engineering 10 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA β€” Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram β€” From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem β€” From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

A histogram kernel with 10,000 threads all calling atomicAdd(&bin[x], 1) achieves 2 GB/s on an A100. The bottleneck: atomic contention. When multiple threads atomic-add to the same bin, the hardware serializes the operations. Each atomic costs 400 cycles, and with heavy contention, effective throughput collapses. Restructure with hierarchical reduction β€” warp-level reduction using shuffles, block-level reduction using shared memory, then one atomic per block β€” and throughput jumps to 180 GB/s. The 90x speedup comes from replacing 10,000 contentious global atomics with 10,000 register operations + 320 shared memory operations + 320 global atomics.

The solution is hierarchical reduction: first reduce within a warp using shuffle instructions, then across warps within a block using shared memory, and finally across blocks using a small number of global atomics. This post covers atomic operations, their performance characteristics, and complete implementations of histogram and parallel prefix sum.

Atomic Operations: Performance Characteristics

atomicAdd Throughput Under Contention

#include <cuda_runtime.h>
#include <cstdio>

// Benchmark atomicAdd to a single address (worst case: max contention)
__global__ void atomic_single_address(int* counter, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        atomicAdd(counter, 1);
    }
}

// Benchmark atomicAdd to many addresses (best case: no contention)
__global__ void atomic_distributed(int* counters, int n, int num_bins) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int bin = idx % num_bins;
        atomicAdd(&counters[bin], 1);
    }
}

// Host benchmark
void benchmark_atomics() {
    int n = 1 << 20;  // 1M threads

    int* d_counter;
    cudaMalloc(&d_counter, sizeof(int));
    cudaMemset(d_counter, 0, sizeof(int));

    int block = 256;
    int grid = (n + block - 1) / block;

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Single address (maximum contention)
    cudaEventRecord(start);
    atomic_single_address<<<grid, block>>>(d_counter, n);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms_single;
    cudaEventElapsedTime(&ms_single, start, stop);

    // Distributed across 1024 bins
    int* d_bins;
    cudaMalloc(&d_bins, 1024 * sizeof(int));
    cudaMemset(d_bins, 0, 1024 * sizeof(int));

    cudaEventRecord(start);
    atomic_distributed<<<grid, block>>>(d_bins, n, 1024);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms_distributed;
    cudaEventElapsedTime(&ms_distributed, start, stop);

    printf("1M atomicAdd to 1 address:    %.3f ms\n", ms_single);
    printf("1M atomicAdd to 1024 bins:    %.3f ms\n", ms_distributed);
    printf("Speedup from distribution: %.1fx\n",
           ms_single / ms_distributed);

    cudaFree(d_counter);
    cudaFree(d_bins);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
}
πŸ“Š

atomicAdd Throughput vs Number of Contended Addresses (A100)

Target AddressesTime (1M ops)Throughput (Gops/s)Relative
1 3.2 ms 0.33 1.0x (worst)
32 0.42 ms 2.5 7.6x
256 0.11 ms 9.5 29x
1024 0.048 ms 21.8 66x
65536 0.021 ms 49.8 151x
1048576 (1 per thread) 0.018 ms 58.1 176x (best)
Note: Atomic throughput scales roughly linearly with the number of distinct addresses until memory bandwidth saturates. Single-address atomicAdd is 176x slower than distributed.

Atomic Types and Their Costs

// Available atomic operations and their hardware support
void atomic_catalog() {
    // Integer atomics (all architectures, fast L2 path)
    // atomicAdd, atomicSub, atomicMin, atomicMax,
    // atomicAnd, atomicOr, atomicXor, atomicCAS, atomicExch

    // FP32 atomicAdd (Kepler+)
    // Uses compare-and-swap internally on older architectures
    // Native hardware unit on Volta+ (fast)

    // FP16 atomicAdd (Volta+, sm_70)
    // Native hardware support for half-precision atomic add

    // FP64 atomicAdd (Pascal+)
    // Much slower than FP32 on most architectures

    // atomicCAS (Compare-And-Swap) -- universal building block
    // Can implement any atomic operation as a CAS loop:
    // do {
    //     old = *addr;
    //     assumed = old;
    //     new_val = f(old, my_value);
    // } while (atomicCAS(addr, assumed, new_val) != assumed);
}

Warp-Level Reduction

Shuffle-Based Reduction

Warp shuffle instructions (__shfl_xor_sync, __shfl_down_sync) exchange register values between threads within a warp at register-file speed (single cycle latency). No shared memory required.

// Warp-level sum reduction using butterfly pattern
__device__ float warp_reduce_sum(float val) {
    // Butterfly reduction: each step halves the number of
    // active participants
    #pragma unroll
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        val += __shfl_xor_sync(0xFFFFFFFF, val, offset);
    }
    return val;  // All threads in the warp have the final sum
}

// Warp-level max reduction
__device__ float warp_reduce_max(float val) {
    #pragma unroll
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        val = fmaxf(val, __shfl_xor_sync(0xFFFFFFFF, val, offset));
    }
    return val;
}

// Warp-level sum with mask (for partial warps)
__device__ float warp_reduce_sum_masked(float val, unsigned mask) {
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        val += __shfl_xor_sync(mask, val, offset);
    }
    return val;
}

Why Warp Reductions Are Fast

def warp_reduction_analysis():
    """Analyze warp reduction cost."""
    # __shfl_xor_sync: 1 cycle latency (register-to-register)
    # 5 steps for 32 threads (log2(32) = 5)
    # Total: 5 cycles per warp

    # Compare to shared memory reduction:
    # Shared memory write: ~20 cycles
    # __syncthreads: ~20 cycles
    # Shared memory read: ~20 cycles
    # Repeat log2(32) = 5 times
    # Total: ~300 cycles per warp

    print("Warp shuffle reduction: ~5 cycles")
    print("Shared memory reduction: ~300 cycles")
    print("Speedup: ~60x")
    print()
    print("But shared memory handles arbitrary block sizes,")
    print("while shuffles only work within a single warp (32 threads).")

Block-Level Reduction

Combining Warp and Block Reductions

For blocks larger than one warp, reduce within each warp first, then combine warp results using shared memory:

// Block-level sum reduction (up to 1024 threads per block)
template <int BLOCK_SIZE>
__device__ float block_reduce_sum(float val) {
    // Step 1: Warp-level reduction
    val = warp_reduce_sum(val);

    // Step 2: Store warp results to shared memory
    __shared__ float warp_results[32];  // Max 32 warps per block
    int warp_id = threadIdx.x / warpSize;
    int lane = threadIdx.x % warpSize;

    if (lane == 0) {
        warp_results[warp_id] = val;
    }
    __syncthreads();

    // Step 3: First warp reduces across all warp results
    constexpr int NUM_WARPS = BLOCK_SIZE / 32;
    if (warp_id == 0) {
        val = (lane < NUM_WARPS) ? warp_results[lane] : 0.0f;
        val = warp_reduce_sum(val);
    }

    return val;  // Result is in thread 0 of the block
}

// Block-level max reduction
template <int BLOCK_SIZE>
__device__ float block_reduce_max(float val) {
    val = warp_reduce_max(val);

    __shared__ float warp_results[32];
    int warp_id = threadIdx.x / warpSize;
    int lane = threadIdx.x % warpSize;

    if (lane == 0) {
        warp_results[warp_id] = val;
    }
    __syncthreads();

    constexpr int NUM_WARPS = BLOCK_SIZE / 32;
    if (warp_id == 0) {
        val = (lane < NUM_WARPS) ? warp_results[lane] : -INFINITY;
        val = warp_reduce_max(val);
    }

    return val;
}

Using Block Reduction for Normalization

// Fused RMSNorm using block-level reduction
__global__ void rmsnorm_kernel(
    float* __restrict__ output,
    const float* __restrict__ input,
    const float* __restrict__ weight,
    int hidden_dim,
    float eps
) {
    // One block per row (token)
    int row = blockIdx.x;
    const float* x = input + row * hidden_dim;
    float* y = output + row * hidden_dim;

    // Each thread accumulates partial sum-of-squares
    float local_ss = 0.0f;
    for (int i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
        float val = x[i];
        local_ss += val * val;
    }

    // Block-level reduction
    float ss = block_reduce_sum<256>(local_ss);

    // Broadcast result
    __shared__ float rms_scale;
    if (threadIdx.x == 0) {
        rms_scale = rsqrtf(ss / hidden_dim + eps);
    }
    __syncthreads();

    // Apply normalization
    for (int i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
        y[i] = x[i] * rms_scale * weight[i];
    }
}

Multi-Block Reduction

The Two-Phase Approach

When the data is too large for a single block, you need a multi-block reduction. The standard approach uses two kernel launches:

// Phase 1: Each block reduces a portion of the data
__global__ void reduce_phase1(
    const float* __restrict__ input,
    float* __restrict__ partial_sums,
    int n
) {
    float local_sum = 0.0f;
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    // Grid-stride loop: each thread handles multiple elements
    for (int i = tid; i < n; i += stride) {
        local_sum += input[i];
    }

    // Block-level reduction
    local_sum = block_reduce_sum<256>(local_sum);

    // Write block result to global memory
    if (threadIdx.x == 0) {
        partial_sums[blockIdx.x] = local_sum;
    }
}

// Phase 2: Single block reduces partial sums
__global__ void reduce_phase2(
    const float* __restrict__ partial_sums,
    float* __restrict__ result,
    int num_blocks
) {
    float local_sum = 0.0f;
    for (int i = threadIdx.x; i < num_blocks; i += blockDim.x) {
        local_sum += partial_sums[i];
    }

    local_sum = block_reduce_sum<256>(local_sum);

    if (threadIdx.x == 0) {
        result[0] = local_sum;
    }
}

Single-Kernel Multi-Block Reduction (Cooperative Groups)

With cooperative groups (CUDA 9.0+), you can synchronize across all blocks in a single kernel, avoiding the two-phase approach:

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void cooperative_reduce(
    const float* __restrict__ input,
    float* __restrict__ result,
    float* __restrict__ partial_sums,
    int n
) {
    cg::grid_group grid = cg::this_grid();

    float local_sum = 0.0f;
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    for (int i = tid; i < n; i += stride) {
        local_sum += input[i];
    }

    // Block-level reduction
    local_sum = block_reduce_sum<256>(local_sum);

    if (threadIdx.x == 0) {
        partial_sums[blockIdx.x] = local_sum;
    }

    // Grid-wide synchronization
    grid.sync();

    // Block 0 reduces partial sums
    if (blockIdx.x == 0) {
        local_sum = 0.0f;
        for (int i = threadIdx.x; i < gridDim.x; i += blockDim.x) {
            local_sum += partial_sums[i];
        }
        local_sum = block_reduce_sum<256>(local_sum);
        if (threadIdx.x == 0) {
            result[0] = local_sum;
        }
    }
}
⚠️ Cooperative Kernel Limitations

Cooperative kernel launch requires that the grid size does not exceed the device’s maximum active blocks. This means you cannot launch an arbitrarily large grid. Use cudaOccupancyMaxActiveBlocksPerMultiprocessor to determine the maximum grid size.

Implementation 1: Histogram Kernel

Naive Histogram (Global Atomics)

// Naive: every thread atomicAdd to global histogram
__global__ void histogram_naive(
    const int* __restrict__ data,
    int* __restrict__ hist,
    int n,
    int num_bins
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int bin = data[idx] % num_bins;
        atomicAdd(&hist[bin], 1);
    }
}

Optimized Histogram (Shared Memory + Warp-Level)

// Optimized: per-block histogram in shared memory,
// then atomicAdd to global histogram

template <int NUM_BINS>
__global__ void histogram_optimized(
    const int* __restrict__ data,
    int* __restrict__ hist,
    int n
) {
    // Per-block histogram in shared memory
    __shared__ int local_hist[NUM_BINS];

    // Initialize shared memory histogram
    for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
        local_hist[i] = 0;
    }
    __syncthreads();

    // Each thread processes multiple elements (grid-stride loop)
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    for (int i = tid; i < n; i += stride) {
        int bin = data[i] % NUM_BINS;
        // Atomic add to shared memory (much faster than global)
        atomicAdd(&local_hist[bin], 1);
    }
    __syncthreads();

    // Write per-block histogram to global histogram
    // Only NUM_BINS global atomics per block (not n/gridSize)
    for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
        if (local_hist[i] > 0) {
            atomicAdd(&hist[i], local_hist[i]);
        }
    }
}

// Even more optimized: per-warp histograms to reduce
// shared memory contention
template <int NUM_BINS>
__global__ void histogram_warp_privatized(
    const int* __restrict__ data,
    int* __restrict__ hist,
    int n
) {
    // Each warp gets its own histogram in shared memory
    constexpr int WARPS_PER_BLOCK = 8;  // blockDim.x / 32
    __shared__ int warp_hist[WARPS_PER_BLOCK][NUM_BINS];

    int warp_id = threadIdx.x / 32;
    int lane = threadIdx.x % 32;

    // Initialize
    for (int i = lane; i < NUM_BINS; i += 32) {
        warp_hist[warp_id][i] = 0;
    }
    __syncwarp();

    // Accumulate into per-warp histogram
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;

    for (int i = tid; i < n; i += stride) {
        int bin = data[i] % NUM_BINS;
        atomicAdd(&warp_hist[warp_id][bin], 1);
    }
    __syncthreads();

    // Reduce per-warp histograms to global
    for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
        int sum = 0;
        for (int w = 0; w < WARPS_PER_BLOCK; w++) {
            sum += warp_hist[w][i];
        }
        if (sum > 0) {
            atomicAdd(&hist[i], sum);
        }
    }
}
πŸ“Š

Histogram Kernel Performance (10M elements, 256 bins, A100)

ImplementationTime (ms)SpeedupGlobal Atomics
Naive (global atomicAdd) 1.82 1.0x 10M
Block shared memory 0.28 6.5x ~1K (grid blocks * bins/block)
Warp-privatized 0.19 9.6x ~1K
CUB DeviceHistogram 0.16 11.4x Implementation-specific
Note: Each optimization level reduces contention on shared/global atomics. The warp-privatized version eliminates shared memory bank conflicts between warps.

Implementation 2: Parallel Prefix Sum (Scan)

The Algorithm

Parallel prefix sum (inclusive scan) computes: given input [a0,a1,...,anβˆ’1][a_0, a_1, ..., a_{n-1}], output [a0,a0+a1,a0+a1+a2,...][a_0, a_0+a_1, a_0+a_1+a_2, ...]. This is a fundamental primitive used in stream compaction, radix sort, and attention mask generation.

Block-Level Scan

// Blelloch-style inclusive scan within a block
template <int BLOCK_SIZE>
__device__ void block_inclusive_scan(
    float* __restrict__ data,
    float* __restrict__ shared_data
) {
    int tid = threadIdx.x;

    // Load data to shared memory
    shared_data[tid] = (tid < BLOCK_SIZE) ? data[tid] : 0.0f;
    __syncthreads();

    // Up-sweep (reduce) phase
    for (int stride = 1; stride < BLOCK_SIZE; stride *= 2) {
        int index = (tid + 1) * stride * 2 - 1;
        if (index < BLOCK_SIZE) {
            shared_data[index] += shared_data[index - stride];
        }
        __syncthreads();
    }

    // Down-sweep phase
    for (int stride = BLOCK_SIZE / 4; stride > 0; stride /= 2) {
        int index = (tid + 1) * stride * 2 - 1;
        if (index + stride < BLOCK_SIZE) {
            shared_data[index + stride] += shared_data[index];
        }
        __syncthreads();
    }

    // Write back
    if (tid < BLOCK_SIZE) {
        data[tid] = shared_data[tid];
    }
}

Multi-Block Scan (Three-Phase)

// Phase 1: Block-level scan + save block sums
__global__ void scan_phase1(
    float* __restrict__ data,
    float* __restrict__ block_sums,
    int n
) {
    __shared__ float shared[1024];
    int block_start = blockIdx.x * blockDim.x;
    int tid = threadIdx.x;
    int gid = block_start + tid;

    // Load
    shared[tid] = (gid < n) ? data[gid] : 0.0f;
    __syncthreads();

    // Inclusive scan within block (Hillis-Steele)
    for (int stride = 1; stride < blockDim.x; stride *= 2) {
        float temp = 0.0f;
        if (tid >= stride) {
            temp = shared[tid - stride];
        }
        __syncthreads();
        shared[tid] += temp;
        __syncthreads();
    }

    // Write scanned data back
    if (gid < n) {
        data[gid] = shared[tid];
    }

    // Save the last element as this block's sum
    if (tid == blockDim.x - 1) {
        block_sums[blockIdx.x] = shared[tid];
    }
}

// Phase 2: Scan the block sums themselves
// (Use phase 1 on the block_sums array; if more than one block
//  is needed, recurse)

// Phase 3: Add scanned block sums to each block's elements
__global__ void scan_phase3(
    float* __restrict__ data,
    const float* __restrict__ scanned_block_sums,
    int n
) {
    if (blockIdx.x == 0) return;  // First block has no prefix to add

    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    if (gid < n) {
        data[gid] += scanned_block_sums[blockIdx.x - 1];
    }
}

// Host-side orchestration
void parallel_inclusive_scan(float* d_data, int n) {
    int block_size = 1024;
    int num_blocks = (n + block_size - 1) / block_size;

    float* d_block_sums;
    cudaMalloc(&d_block_sums, num_blocks * sizeof(float));

    // Phase 1: Scan within blocks
    scan_phase1<<<num_blocks, block_size>>>(d_data, d_block_sums, n);

    // Phase 2: Scan block sums
    if (num_blocks > 1) {
        if (num_blocks <= block_size) {
            // Single block can handle all block sums
            float* d_dummy;
            cudaMalloc(&d_dummy, sizeof(float));
            scan_phase1<<<1, num_blocks>>>(d_block_sums, d_dummy,
                                            num_blocks);
            cudaFree(d_dummy);
        } else {
            // Recursive scan for very large arrays
            parallel_inclusive_scan(d_block_sums, num_blocks);
        }
    }

    // Phase 3: Propagate block sums
    scan_phase3<<<num_blocks, block_size>>>(
        d_data, d_block_sums, n
    );

    cudaFree(d_block_sums);
}

Warp-Level Scan (Faster for Small Arrays)

// Inclusive scan within a warp using __shfl_up_sync
__device__ float warp_inclusive_scan(float val) {
    unsigned mask = 0xFFFFFFFF;

    #pragma unroll
    for (int offset = 1; offset < 32; offset *= 2) {
        float n = __shfl_up_sync(mask, val, offset);
        if (threadIdx.x % 32 >= offset) {
            val += n;
        }
    }
    return val;
}

// Optimized block scan using warp scan as building block
template <int BLOCK_SIZE>
__global__ void optimized_block_scan(
    float* __restrict__ data,
    float* __restrict__ block_sums,
    int n
) {
    constexpr int NUM_WARPS = BLOCK_SIZE / 32;
    __shared__ float warp_sums[NUM_WARPS];

    int gid = blockIdx.x * BLOCK_SIZE + threadIdx.x;
    int warp_id = threadIdx.x / 32;
    int lane = threadIdx.x % 32;

    // Load
    float val = (gid < n) ? data[gid] : 0.0f;

    // Warp-level inclusive scan
    val = warp_inclusive_scan(val);

    // Save warp sums
    if (lane == 31) {
        warp_sums[warp_id] = val;
    }
    __syncthreads();

    // Scan warp sums (first warp only)
    if (warp_id == 0 && lane < NUM_WARPS) {
        float ws = warp_sums[lane];
        ws = warp_inclusive_scan(ws);
        warp_sums[lane] = ws;
    }
    __syncthreads();

    // Add prefix from previous warps
    if (warp_id > 0) {
        val += warp_sums[warp_id - 1];
    }

    // Write back
    if (gid < n) {
        data[gid] = val;
    }

    // Save block sum
    if (threadIdx.x == BLOCK_SIZE - 1) {
        block_sums[blockIdx.x] = val;
    }
}
πŸ“Š

Parallel Prefix Sum Performance (16M elements, FP32, A100)

ImplementationTime (ms)Bandwidth (GB/s)Relative
Naive Hillis-Steele scan 2.45 52.4 1.0x
Blelloch with shared memory 0.82 156.6 3.0x
Warp-scan + block-scan 0.51 251.7 4.8x
CUB DeviceScan 0.38 337.5 6.4x
Note: The warp-scan approach is 4.8x faster than naive because it uses register-speed shuffles for the innermost 32-element scan. CUB DeviceScan adds adaptive tuning.

Reduction Pattern Performance Hierarchy

(relative throughput)
Global atomicAdd baseline
1 relative throughput
Shared mem atomicAdd
8 relative throughput
Shared mem reduction
30 relative throughput
Warp shuffle
60 relative throughput
Warp shuffle + shared optimal
100 relative throughput

Practical Guidelines

Choosing the Right Reduction Strategy

def reduction_strategy_guide():
    """Guide for choosing reduction implementation."""
    strategies = {
        "32 or fewer values": {
            "method": "Single warp shuffle reduction",
            "cost": "5 cycles",
            "example": "Reduce within a warp",
        },
        "33-1024 values": {
            "method": "Block reduction (warp shuffle + shared memory)",
            "cost": "~50 cycles",
            "example": "Reduce hidden dimension for LayerNorm",
        },
        "1K-1M values": {
            "method": "Multi-block: block reduction + global atomic",
            "cost": "~500 cycles + 1 atomic per block",
            "example": "Loss reduction, gradient accumulation",
        },
        "More than 1M values": {
            "method": "Two-phase: block reduction + scan of block sums",
            "cost": "2 kernel launches",
            "example": "Full-array reduction, histogram",
        },
    }

    for size_range, info in strategies.items():
        print(f"\n{size_range}:")
        for key, val in info.items():
            print(f"  {key}: {val}")
πŸ’‘ Use CUB or Thrust for Production Code

NVIDIA’s CUB library (cub::DeviceReduce, cub::DeviceScan, cub::DeviceHistogram) provides highly optimized implementations of all reduction patterns, auto-tuned for each GPU architecture. Write custom reduction kernels only when you need to fuse the reduction with other operations (e.g., LayerNorm = reduction + elementwise) or when the CUB API does not cover your specific pattern.

Summary

Atomic operations are simple but scale poorly under contention. The optimization hierarchy is: warp-level shuffle reductions (register speed, 32 threads), block-level reductions (shared memory, up to 1024 threads), and multi-block reductions (global memory coordination, unlimited threads). Each level reduces the number of global atomics by a factor of 32-256, producing 10-100x speedups over naive global atomics.

The two concrete implementations β€” histogram and prefix sum β€” demonstrate these patterns in practice. The histogram kernel moves from 10M global atomics (naive) to under 1K (warp-privatized shared memory histograms merged with a single round of global atomics). The prefix sum uses warp-level scans as building blocks for an efficient multi-block scan that achieves 250+ GB/s effective bandwidth on an A100.