Part of Series CUDA Kernel Engineering 16 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

Summing 100 million floats on a CPU takes 100 million sequential additions. On a GPU with 10,000 threads, the naive approach takes 10,000 additions per thread but serializes at the end — no faster than the CPU. The optimized approach uses three-level hierarchical reduction: warps reduce 32 values to 1 using shuffle instructions (5 operations, zero memory), blocks reduce across warps using shared memory (one sync barrier per tree level), and the grid reduces across blocks using atomics (one atomic per block). Total work: 100 million reads + 100 million additions + 3,125 atomics. Throughput: 1.8 TB/s on an A100, limited only by memory bandwidth.

All measurements target NVIDIA Ampere (A100-80GB SXM, SM 8.0) unless stated otherwise.

Reduction Version 0: Single Thread (Baseline)

#include <cuda_runtime.h>

// Terrible: single thread reads all N elements
__global__ void reduce_v0_single_thread(const float* data, float* result, int n) {
    if (threadIdx.x != 0 || blockIdx.x != 0) return;

    float sum = 0.0f;
    for (int i = 0; i < n; i++) {
        sum += data[i];
    }
    *result = sum;
}

This runs at approximately 10 GB/s on an A100 — about 0.5% of peak bandwidth. Every subsequent version improves on this.

Reduction Version 1: Interleaved Addressing (Divergent)

// Each thread block reduces its portion using shared memory
__global__ void reduce_v1_interleaved(const float* data, float* partial, int n) {
    extern __shared__ float sdata[];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = (idx < n) ? data[idx] : 0.0f;
    __syncthreads();

    // Interleaved addressing: thread i adds element at i + stride
    for (int s = 1; s < blockDim.x; s *= 2) {
        if (tid % (2 * s) == 0) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) partial[blockIdx.x] = sdata[0];
}

Problem: the tid % (2 * s) == 0 check causes warp divergence. In the first iteration, only even-numbered threads are active. Half the threads in each warp do nothing, but they still occupy scheduling resources.

Reduction Version 2: Sequential Addressing (No Divergence)

__global__ void reduce_v2_sequential(const float* data, float* partial, int n) {
    extern __shared__ float sdata[];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = (idx < n) ? data[idx] : 0.0f;
    __syncthreads();

    // Sequential addressing: contiguous threads are active
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) partial[blockIdx.x] = sdata[0];
}

Now threads 0..s-1 are active in each iteration, which keeps warps fully active (no divergence within a warp) until s drops below 32.

📊

Reduction V1 vs V2: Divergence Impact (A100, 16M floats, 256 threads/block)

VersionBandwidth (GB/s)% PeakIssue
V1 Interleaved 210 10.8% Warp divergence
V2 Sequential 380 19.6% None
Note: Eliminating divergence nearly doubles throughput. V2 keeps entire warps active through more iterations.

Reduction Version 3: First Add During Load

Half the threads in V2 do nothing in the first iteration. Fix this by loading two elements per thread:

__global__ void reduce_v3_first_add(const float* data, float* partial, int n) {
    extern __shared__ float sdata[];

    int tid = threadIdx.x;
    int idx = blockIdx.x * (blockDim.x * 2) + threadIdx.x;

    // Each thread loads and adds two elements
    float sum = 0.0f;
    if (idx < n) sum = data[idx];
    if (idx + blockDim.x < n) sum += data[idx + blockDim.x];
    sdata[tid] = sum;
    __syncthreads();

    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) partial[blockIdx.x] = sdata[0];
}

This doubles the elements per block (each block handles 2 * blockDim.x elements), cutting the number of blocks in half and improving throughput because more arithmetic is done relative to shared memory operations.

Reduction Version 4: Warp-Level Unrolling

When s drops below 32, all remaining active threads are in a single warp. We can replace the shared memory tree with warp shuffle instructions:

__device__ float warp_reduce_sum(float val) {
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_down_sync(0xffffffff, val, offset);
    }
    return val;
}

__global__ void reduce_v4_warp_unroll(const float* data, float* partial, int n) {
    extern __shared__ float sdata[];

    int tid = threadIdx.x;
    int idx = blockIdx.x * (blockDim.x * 2) + threadIdx.x;

    float sum = 0.0f;
    if (idx < n) sum = data[idx];
    if (idx + blockDim.x < n) sum += data[idx + blockDim.x];
    sdata[tid] = sum;
    __syncthreads();

    // Tree reduction in shared memory until 32 threads remain
    for (int s = blockDim.x / 2; s > 32; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // Final warp uses shuffle (no shared memory, no syncthreads)
    if (tid < 32) {
        float val = sdata[tid];
        val = warp_reduce_sum(val);
        if (tid == 0) partial[blockIdx.x] = val;
    }
}

Reduction Version 5: Complete Warp Shuffle (No Shared Memory for Small Blocks)

For blocks of 256 or fewer threads (8 warps), we can do the entire reduction with warp shuffles + one small shared memory step:

__global__ void reduce_v5_full_shuffle(const float* data, float* partial, int n) {
    int tid = threadIdx.x;
    int idx = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
    int stride = blockDim.x * 2 * gridDim.x;

    // Grid-stride loop: each thread accumulates multiple elements
    float sum = 0.0f;
    for (int i = idx; i < n; i += stride) {
        sum += data[i];
        if (i + blockDim.x < n) sum += data[i + blockDim.x];
    }

    // Warp-level reduction
    sum = warp_reduce_sum(sum);

    // Warp leaders write to shared memory
    __shared__ float warp_sums[32];  // Max 32 warps per block
    int warp_id = tid / 32;
    int lane = tid % 32;

    if (lane == 0) warp_sums[warp_id] = sum;
    __syncthreads();

    // First warp reduces warp sums
    if (warp_id == 0) {
        int num_warps = blockDim.x / 32;
        sum = (lane < num_warps) ? warp_sums[lane] : 0.0f;
        sum = warp_reduce_sum(sum);

        if (lane == 0) partial[blockIdx.x] = sum;
    }
}
📊

Reduction Versions Comparison (A100, 64M floats)

VersionDescriptionBandwidth (GB/s)% Peak
V0 Single thread 10 0.5%
V1 Interleaved (divergent) 210 10.8%
V2 Sequential addressing 380 19.6%
V3 First add during load 620 32.0%
V4 Warp unrolling 1180 61.0%
V5 Full shuffle + grid stride 1720 88.9%
CUB DeviceReduce Library 1810 93.5%
Note: Each optimization roughly doubles bandwidth. V5 reaches 89% of peak. CUB is 5% faster due to auto-tuned tile sizes and vectorized loads.

Reduction Bandwidth by Optimization Level

(GB/s)
V0 Serial
10 GB/s
V1 Divergent
210 GB/s
V2 Sequential
380 GB/s
V3 First-add
620 GB/s
V4 Warp unroll
1,180 GB/s
V5 Full shuffle 89% peak
1,720 GB/s
CUB
1,810 GB/s

Multi-Block Reduction: Two-Pass vs Atomic

Two-Pass Approach

// Pass 1: each block reduces its portion
reduce_v5_full_shuffle<<<grid_size, block_size>>>(d_input, d_partial, n);

// Pass 2: reduce the partial sums (single block if grid_size <= 1024)
reduce_v5_full_shuffle<<<1, 256>>>(d_partial, d_result, grid_size);

Atomic Approach (Single Kernel)

__global__ void reduce_atomic(const float* data, float* result, int n) {
    int tid = threadIdx.x;
    int idx = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
    int stride = blockDim.x * 2 * gridDim.x;

    float sum = 0.0f;
    for (int i = idx; i < n; i += stride) {
        sum += data[i];
        if (i + blockDim.x < n) sum += data[i + blockDim.x];
    }

    // Warp reduction
    sum = warp_reduce_sum(sum);

    // Block reduction via shared memory
    __shared__ float warp_sums[32];
    int warp_id = tid / 32;
    int lane = tid % 32;

    if (lane == 0) warp_sums[warp_id] = sum;
    __syncthreads();

    if (warp_id == 0) {
        sum = (lane < blockDim.x / 32) ? warp_sums[lane] : 0.0f;
        sum = warp_reduce_sum(sum);

        // Single atomic per block
        if (lane == 0) {
            atomicAdd(result, sum);
        }
    }
}
💡 Atomic Reduction for Simple Cases

The atomic approach is simpler (single kernel) and competitive when the grid is moderate (hundreds of blocks). With one atomicAdd per block, contention is low. For grids with 100,000+ blocks, the two-pass approach avoids atomic contention entirely.

Max Reduction and ArgMax

__device__ float warp_reduce_max(float val) {
    for (int offset = 16; offset > 0; offset >>= 1) {
        float other = __shfl_down_sync(0xffffffff, val, offset);
        val = fmaxf(val, other);
    }
    return val;
}

// ArgMax: find both the maximum value and its index
struct ValIdx {
    float val;
    int idx;
};

__device__ ValIdx warp_reduce_argmax(ValIdx vi) {
    for (int offset = 16; offset > 0; offset >>= 1) {
        float other_val = __shfl_down_sync(0xffffffff, vi.val, offset);
        int other_idx = __shfl_down_sync(0xffffffff, vi.idx, offset);
        if (other_val > vi.val) {
            vi.val = other_val;
            vi.idx = other_idx;
        }
    }
    return vi;
}

__global__ void argmax_kernel(const float* data, ValIdx* result, int n) {
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    ValIdx local_best = { -INFINITY, -1 };

    for (int i = idx; i < n; i += stride) {
        if (data[i] > local_best.val) {
            local_best.val = data[i];
            local_best.idx = i;
        }
    }

    // Warp reduction
    local_best = warp_reduce_argmax(local_best);

    __shared__ float sval[32];
    __shared__ int sidx[32];
    int warp_id = tid / 32;
    int lane = tid % 32;

    if (lane == 0) {
        sval[warp_id] = local_best.val;
        sidx[warp_id] = local_best.idx;
    }
    __syncthreads();

    if (warp_id == 0) {
        int num_warps = blockDim.x / 32;
        ValIdx vi;
        vi.val = (lane < num_warps) ? sval[lane] : -INFINITY;
        vi.idx = (lane < num_warps) ? sidx[lane] : -1;
        vi = warp_reduce_argmax(vi);

        if (lane == 0) {
            // Atomic CAS loop to update global argmax
            // Use atomicCAS on the value, then conditionally update index
            result[blockIdx.x].val = vi.val;
            result[blockIdx.x].idx = vi.idx;
        }
    }
}

Histogram: Privatized per Warp, then Merged

Histograms are reductions where the output is not a single value but an array of bins:

#include <cuda_runtime.h>

// Naive histogram: all threads atomicAdd to global bins
__global__ void histogram_naive(const unsigned char* data,
                                 int* histogram, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = idx; i < n; i += stride) {
        atomicAdd(&histogram[data[i]], 1);
    }
}

// Optimized: per-block privatization in shared memory
__global__ void histogram_shared(const unsigned char* data,
                                  int* histogram, int n, int num_bins) {
    extern __shared__ int s_hist[];

    int tid = threadIdx.x;

    // Initialize shared memory histogram
    for (int i = tid; i < num_bins; i += blockDim.x) {
        s_hist[i] = 0;
    }
    __syncthreads();

    // Accumulate in shared memory (much lower contention)
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = idx; i < n; i += stride) {
        atomicAdd(&s_hist[data[i]], 1);
    }
    __syncthreads();

    // Merge shared histogram into global
    for (int i = tid; i < num_bins; i += blockDim.x) {
        if (s_hist[i] > 0) {
            atomicAdd(&histogram[i], s_hist[i]);
        }
    }
}

// Most optimized: per-warp privatization (avoids shared memory atomics)
__global__ void histogram_warp_private(const unsigned char* data,
                                        int* histogram, int n) {
    // Each warp has its own 256-bin histogram in shared memory
    // 256 bins * 4 bytes * (blockDim.x / 32) warps
    extern __shared__ int s_hist[];

    int tid = threadIdx.x;
    int warp_id = tid / 32;
    int lane = tid % 32;
    int num_warps = blockDim.x / 32;

    // This warp's private histogram
    int* my_hist = s_hist + warp_id * 256;

    // Initialize
    for (int i = lane; i < 256; i += 32) {
        my_hist[i] = 0;
    }
    __syncwarp();

    // Accumulate — no atomics needed if each warp is private
    // But multiple lanes in a warp may hit the same bin, so we still need atomics
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = idx; i < n; i += stride) {
        atomicAdd(&my_hist[data[i]], 1);  // Intra-warp contention only
    }
    __syncthreads();

    // Merge all warp histograms into global
    for (int bin = tid; bin < 256; bin += blockDim.x) {
        int sum = 0;
        for (int w = 0; w < num_warps; w++) {
            sum += s_hist[w * 256 + bin];
        }
        if (sum > 0) {
            atomicAdd(&histogram[bin], sum);
        }
    }
}
📊

256-Bin Histogram: Optimization Levels (A100, 256M bytes)

VersionBandwidth (GB/s)Time (ms)Speedup
Naive (global atomics) 42 5.8 1.0x
Per-block (shared atomics) 380 0.65 8.9x
Per-warp private 520 0.47 12.3x
CUB DeviceHistogram 580 0.42 13.8x
Note: Per-warp privatization reduces contention from thousands of threads to at most 32. The shared memory cost is 256 * 4 * num_warps bytes per block.

Histogram Throughput by Optimization Level

(GB/s)
Naive
42 GB/s
Per-block
380 GB/s
Per-warp 12x faster
520 GB/s
CUB
580 GB/s

Reduction with Vectorized Loads

Combining vectorized loads (float4) with warp reduction for maximum bandwidth:

__global__ void reduce_vectorized(const float* __restrict__ data,
                                   float* result, int n) {
    const float4* data4 = reinterpret_cast<const float4*>(data);
    int n4 = n / 4;
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    float sum = 0.0f;

    // Vectorized grid-stride loop
    for (int i = idx; i < n4; i += stride) {
        float4 val = data4[i];
        sum += val.x + val.y + val.z + val.w;
    }

    // Handle remainder
    int rem_start = n4 * 4;
    for (int i = rem_start + idx; i < n; i += stride) {
        sum += data[i];
    }

    // Warp reduction
    sum = warp_reduce_sum(sum);

    __shared__ float warp_sums[32];
    int warp_id = threadIdx.x / 32;
    int lane = threadIdx.x % 32;

    if (lane == 0) warp_sums[warp_id] = sum;
    __syncthreads();

    if (warp_id == 0) {
        sum = (lane < blockDim.x / 32) ? warp_sums[lane] : 0.0f;
        sum = warp_reduce_sum(sum);
        if (lane == 0) atomicAdd(result, sum);
    }
}

CUB Library: Production-Quality Reductions

For production code, use CUB (CUDA UnBound) or Thrust:

#include <cub/cub.cuh>

void cub_reduction(const float* d_input, float* d_output, int n) {
    // Step 1: Query temp storage size
    void* d_temp = nullptr;
    size_t temp_bytes = 0;
    cub::DeviceReduce::Sum(d_temp, temp_bytes, d_input, d_output, n);

    // Step 2: Allocate temp storage
    cudaMalloc(&d_temp, temp_bytes);

    // Step 3: Run reduction
    cub::DeviceReduce::Sum(d_temp, temp_bytes, d_input, d_output, n);

    cudaFree(d_temp);
}

// Block-level CUB reduction (composable within your own kernels)
__global__ void kernel_with_cub_block_reduce(const float* data,
                                              float* block_results, int n) {
    typedef cub::BlockReduce<float, 256> BlockReduce;
    __shared__ typename BlockReduce::TempStorage temp_storage;

    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = (idx < n) ? data[idx] : 0.0f;

    // Block-level sum
    float block_sum = BlockReduce(temp_storage).Sum(val);

    if (threadIdx.x == 0) {
        block_results[blockIdx.x] = block_sum;
    }
}

// Warp-level CUB reduction
__global__ void kernel_with_cub_warp_reduce(const float* data,
                                             float* warp_results, int n) {
    typedef cub::WarpReduce<float> WarpReduce;
    __shared__ typename WarpReduce::TempStorage temp_storage[8];  // 8 warps

    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int warp_id = threadIdx.x / 32;
    float val = (idx < n) ? data[idx] : 0.0f;

    float warp_sum = WarpReduce(temp_storage[warp_id]).Sum(val);

    if (threadIdx.x % 32 == 0) {
        int global_warp = blockIdx.x * (blockDim.x / 32) + warp_id;
        warp_results[global_warp] = warp_sum;
    }
}
💡 Use CUB for Production, Write Custom for Learning

CUB’s reductions are auto-tuned per architecture, handle all edge cases, and use vectorized loads. For production code, cub::DeviceReduce::Sum, cub::DeviceReduce::Max, and cub::DeviceReduce::ArgMax are the right choice. Write custom reductions when you need to fuse reduction with other operations (e.g., reduce + scale + bias in a single kernel).

Template-Based Generic Reduction

A reusable reduction that works with any associative operator:

template <typename T, typename Op>
__device__ T warp_reduce(T val, Op op) {
    for (int offset = 16; offset > 0; offset >>= 1) {
        T other = __shfl_down_sync(0xffffffff, val, offset);
        val = op(val, other);
    }
    return val;
}

template <typename T, typename Op>
__global__ void generic_reduce(const T* __restrict__ data,
                                T* result, int n, Op op, T identity) {
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    T acc = identity;
    for (int i = idx; i < n; i += stride) {
        acc = op(acc, data[i]);
    }

    acc = warp_reduce(acc, op);

    __shared__ T warp_results[32];
    int warp_id = tid / 32;
    int lane = tid % 32;

    if (lane == 0) warp_results[warp_id] = acc;
    __syncthreads();

    if (warp_id == 0) {
        acc = (lane < blockDim.x / 32) ? warp_results[lane] : identity;
        acc = warp_reduce(acc, op);
        if (lane == 0) atomicAdd(result, acc);  // Only works for sum
    }
}

// Usage:
struct MaxOp {
    __device__ float operator()(float a, float b) { return fmaxf(a, b); }
};

struct SumOp {
    __device__ float operator()(float a, float b) { return a + b; }
};

// generic_reduce<<<grid, block>>>(d_data, d_result, n, SumOp{}, 0.0f);
// generic_reduce<<<grid, block>>>(d_data, d_result, n, MaxOp{}, -INFINITY);

Summary

Parallel reduction on CUDA progresses through clear optimization levels: (1) sequential addressing eliminates divergence, (2) first-add-during-load doubles elements per block, (3) warp shuffle replaces shared memory for the final 32 threads, (4) grid-stride loops with vectorized loads maximize memory bandwidth. The fully optimized version reaches 88-93% of peak memory bandwidth because reduction is entirely memory-bound — the computation (one addition per loaded element) is trivial compared to the data movement. For histograms, per-warp privatization reduces atomic contention by orders of magnitude. In production, CUB provides auto-tuned implementations; write custom reductions when fusing with other operations.