Summing 100 million floats on a CPU takes 100 million sequential additions. On a GPU with 10,000 threads, the naive approach takes 10,000 additions per thread but serializes at the end — no faster than the CPU. The optimized approach uses three-level hierarchical reduction: warps reduce 32 values to 1 using shuffle instructions (5 operations, zero memory), blocks reduce across warps using shared memory (one sync barrier per tree level), and the grid reduces across blocks using atomics (one atomic per block). Total work: 100 million reads + 100 million additions + 3,125 atomics. Throughput: 1.8 TB/s on an A100, limited only by memory bandwidth.
All measurements target NVIDIA Ampere (A100-80GB SXM, SM 8.0) unless stated otherwise.
Reduction Version 0: Single Thread (Baseline)
#include <cuda_runtime.h>
// Terrible: single thread reads all N elements
__global__ void reduce_v0_single_thread(const float* data, float* result, int n) {
if (threadIdx.x != 0 || blockIdx.x != 0) return;
float sum = 0.0f;
for (int i = 0; i < n; i++) {
sum += data[i];
}
*result = sum;
}
This runs at approximately 10 GB/s on an A100 — about 0.5% of peak bandwidth. Every subsequent version improves on this.
Reduction Version 1: Interleaved Addressing (Divergent)
// Each thread block reduces its portion using shared memory
__global__ void reduce_v1_interleaved(const float* data, float* partial, int n) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = (idx < n) ? data[idx] : 0.0f;
__syncthreads();
// Interleaved addressing: thread i adds element at i + stride
for (int s = 1; s < blockDim.x; s *= 2) {
if (tid % (2 * s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) partial[blockIdx.x] = sdata[0];
}
Problem: the tid % (2 * s) == 0 check causes warp divergence. In the first iteration, only even-numbered threads are active. Half the threads in each warp do nothing, but they still occupy scheduling resources.
Reduction Version 2: Sequential Addressing (No Divergence)
__global__ void reduce_v2_sequential(const float* data, float* partial, int n) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = (idx < n) ? data[idx] : 0.0f;
__syncthreads();
// Sequential addressing: contiguous threads are active
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) partial[blockIdx.x] = sdata[0];
}
Now threads 0..s-1 are active in each iteration, which keeps warps fully active (no divergence within a warp) until s drops below 32.
Reduction V1 vs V2: Divergence Impact (A100, 16M floats, 256 threads/block)
| Version | Bandwidth (GB/s) | % Peak | Issue |
|---|---|---|---|
| V1 Interleaved | 210 | 10.8% | Warp divergence |
| V2 Sequential | 380 | 19.6% | None |
Reduction Version 3: First Add During Load
Half the threads in V2 do nothing in the first iteration. Fix this by loading two elements per thread:
__global__ void reduce_v3_first_add(const float* data, float* partial, int n) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int idx = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
// Each thread loads and adds two elements
float sum = 0.0f;
if (idx < n) sum = data[idx];
if (idx + blockDim.x < n) sum += data[idx + blockDim.x];
sdata[tid] = sum;
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) partial[blockIdx.x] = sdata[0];
}
This doubles the elements per block (each block handles 2 * blockDim.x elements), cutting the number of blocks in half and improving throughput because more arithmetic is done relative to shared memory operations.
Reduction Version 4: Warp-Level Unrolling
When s drops below 32, all remaining active threads are in a single warp. We can replace the shared memory tree with warp shuffle instructions:
__device__ float warp_reduce_sum(float val) {
for (int offset = 16; offset > 0; offset >>= 1) {
val += __shfl_down_sync(0xffffffff, val, offset);
}
return val;
}
__global__ void reduce_v4_warp_unroll(const float* data, float* partial, int n) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int idx = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
float sum = 0.0f;
if (idx < n) sum = data[idx];
if (idx + blockDim.x < n) sum += data[idx + blockDim.x];
sdata[tid] = sum;
__syncthreads();
// Tree reduction in shared memory until 32 threads remain
for (int s = blockDim.x / 2; s > 32; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// Final warp uses shuffle (no shared memory, no syncthreads)
if (tid < 32) {
float val = sdata[tid];
val = warp_reduce_sum(val);
if (tid == 0) partial[blockIdx.x] = val;
}
}
Reduction Version 5: Complete Warp Shuffle (No Shared Memory for Small Blocks)
For blocks of 256 or fewer threads (8 warps), we can do the entire reduction with warp shuffles + one small shared memory step:
__global__ void reduce_v5_full_shuffle(const float* data, float* partial, int n) {
int tid = threadIdx.x;
int idx = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
int stride = blockDim.x * 2 * gridDim.x;
// Grid-stride loop: each thread accumulates multiple elements
float sum = 0.0f;
for (int i = idx; i < n; i += stride) {
sum += data[i];
if (i + blockDim.x < n) sum += data[i + blockDim.x];
}
// Warp-level reduction
sum = warp_reduce_sum(sum);
// Warp leaders write to shared memory
__shared__ float warp_sums[32]; // Max 32 warps per block
int warp_id = tid / 32;
int lane = tid % 32;
if (lane == 0) warp_sums[warp_id] = sum;
__syncthreads();
// First warp reduces warp sums
if (warp_id == 0) {
int num_warps = blockDim.x / 32;
sum = (lane < num_warps) ? warp_sums[lane] : 0.0f;
sum = warp_reduce_sum(sum);
if (lane == 0) partial[blockIdx.x] = sum;
}
}
Reduction Versions Comparison (A100, 64M floats)
| Version | Description | Bandwidth (GB/s) | % Peak |
|---|---|---|---|
| V0 | Single thread | 10 | 0.5% |
| V1 | Interleaved (divergent) | 210 | 10.8% |
| V2 | Sequential addressing | 380 | 19.6% |
| V3 | First add during load | 620 | 32.0% |
| V4 | Warp unrolling | 1180 | 61.0% |
| V5 | Full shuffle + grid stride | 1720 | 88.9% |
| CUB DeviceReduce | Library | 1810 | 93.5% |
Reduction Bandwidth by Optimization Level
(GB/s)Multi-Block Reduction: Two-Pass vs Atomic
Two-Pass Approach
// Pass 1: each block reduces its portion
reduce_v5_full_shuffle<<<grid_size, block_size>>>(d_input, d_partial, n);
// Pass 2: reduce the partial sums (single block if grid_size <= 1024)
reduce_v5_full_shuffle<<<1, 256>>>(d_partial, d_result, grid_size);
Atomic Approach (Single Kernel)
__global__ void reduce_atomic(const float* data, float* result, int n) {
int tid = threadIdx.x;
int idx = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
int stride = blockDim.x * 2 * gridDim.x;
float sum = 0.0f;
for (int i = idx; i < n; i += stride) {
sum += data[i];
if (i + blockDim.x < n) sum += data[i + blockDim.x];
}
// Warp reduction
sum = warp_reduce_sum(sum);
// Block reduction via shared memory
__shared__ float warp_sums[32];
int warp_id = tid / 32;
int lane = tid % 32;
if (lane == 0) warp_sums[warp_id] = sum;
__syncthreads();
if (warp_id == 0) {
sum = (lane < blockDim.x / 32) ? warp_sums[lane] : 0.0f;
sum = warp_reduce_sum(sum);
// Single atomic per block
if (lane == 0) {
atomicAdd(result, sum);
}
}
}
The atomic approach is simpler (single kernel) and competitive when the grid is moderate (hundreds of blocks). With one atomicAdd per block, contention is low. For grids with 100,000+ blocks, the two-pass approach avoids atomic contention entirely.
Max Reduction and ArgMax
__device__ float warp_reduce_max(float val) {
for (int offset = 16; offset > 0; offset >>= 1) {
float other = __shfl_down_sync(0xffffffff, val, offset);
val = fmaxf(val, other);
}
return val;
}
// ArgMax: find both the maximum value and its index
struct ValIdx {
float val;
int idx;
};
__device__ ValIdx warp_reduce_argmax(ValIdx vi) {
for (int offset = 16; offset > 0; offset >>= 1) {
float other_val = __shfl_down_sync(0xffffffff, vi.val, offset);
int other_idx = __shfl_down_sync(0xffffffff, vi.idx, offset);
if (other_val > vi.val) {
vi.val = other_val;
vi.idx = other_idx;
}
}
return vi;
}
__global__ void argmax_kernel(const float* data, ValIdx* result, int n) {
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
ValIdx local_best = { -INFINITY, -1 };
for (int i = idx; i < n; i += stride) {
if (data[i] > local_best.val) {
local_best.val = data[i];
local_best.idx = i;
}
}
// Warp reduction
local_best = warp_reduce_argmax(local_best);
__shared__ float sval[32];
__shared__ int sidx[32];
int warp_id = tid / 32;
int lane = tid % 32;
if (lane == 0) {
sval[warp_id] = local_best.val;
sidx[warp_id] = local_best.idx;
}
__syncthreads();
if (warp_id == 0) {
int num_warps = blockDim.x / 32;
ValIdx vi;
vi.val = (lane < num_warps) ? sval[lane] : -INFINITY;
vi.idx = (lane < num_warps) ? sidx[lane] : -1;
vi = warp_reduce_argmax(vi);
if (lane == 0) {
// Atomic CAS loop to update global argmax
// Use atomicCAS on the value, then conditionally update index
result[blockIdx.x].val = vi.val;
result[blockIdx.x].idx = vi.idx;
}
}
}
Histogram: Privatized per Warp, then Merged
Histograms are reductions where the output is not a single value but an array of bins:
#include <cuda_runtime.h>
// Naive histogram: all threads atomicAdd to global bins
__global__ void histogram_naive(const unsigned char* data,
int* histogram, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n; i += stride) {
atomicAdd(&histogram[data[i]], 1);
}
}
// Optimized: per-block privatization in shared memory
__global__ void histogram_shared(const unsigned char* data,
int* histogram, int n, int num_bins) {
extern __shared__ int s_hist[];
int tid = threadIdx.x;
// Initialize shared memory histogram
for (int i = tid; i < num_bins; i += blockDim.x) {
s_hist[i] = 0;
}
__syncthreads();
// Accumulate in shared memory (much lower contention)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n; i += stride) {
atomicAdd(&s_hist[data[i]], 1);
}
__syncthreads();
// Merge shared histogram into global
for (int i = tid; i < num_bins; i += blockDim.x) {
if (s_hist[i] > 0) {
atomicAdd(&histogram[i], s_hist[i]);
}
}
}
// Most optimized: per-warp privatization (avoids shared memory atomics)
__global__ void histogram_warp_private(const unsigned char* data,
int* histogram, int n) {
// Each warp has its own 256-bin histogram in shared memory
// 256 bins * 4 bytes * (blockDim.x / 32) warps
extern __shared__ int s_hist[];
int tid = threadIdx.x;
int warp_id = tid / 32;
int lane = tid % 32;
int num_warps = blockDim.x / 32;
// This warp's private histogram
int* my_hist = s_hist + warp_id * 256;
// Initialize
for (int i = lane; i < 256; i += 32) {
my_hist[i] = 0;
}
__syncwarp();
// Accumulate — no atomics needed if each warp is private
// But multiple lanes in a warp may hit the same bin, so we still need atomics
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n; i += stride) {
atomicAdd(&my_hist[data[i]], 1); // Intra-warp contention only
}
__syncthreads();
// Merge all warp histograms into global
for (int bin = tid; bin < 256; bin += blockDim.x) {
int sum = 0;
for (int w = 0; w < num_warps; w++) {
sum += s_hist[w * 256 + bin];
}
if (sum > 0) {
atomicAdd(&histogram[bin], sum);
}
}
}
256-Bin Histogram: Optimization Levels (A100, 256M bytes)
| Version | Bandwidth (GB/s) | Time (ms) | Speedup |
|---|---|---|---|
| Naive (global atomics) | 42 | 5.8 | 1.0x |
| Per-block (shared atomics) | 380 | 0.65 | 8.9x |
| Per-warp private | 520 | 0.47 | 12.3x |
| CUB DeviceHistogram | 580 | 0.42 | 13.8x |
Histogram Throughput by Optimization Level
(GB/s)Reduction with Vectorized Loads
Combining vectorized loads (float4) with warp reduction for maximum bandwidth:
__global__ void reduce_vectorized(const float* __restrict__ data,
float* result, int n) {
const float4* data4 = reinterpret_cast<const float4*>(data);
int n4 = n / 4;
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
float sum = 0.0f;
// Vectorized grid-stride loop
for (int i = idx; i < n4; i += stride) {
float4 val = data4[i];
sum += val.x + val.y + val.z + val.w;
}
// Handle remainder
int rem_start = n4 * 4;
for (int i = rem_start + idx; i < n; i += stride) {
sum += data[i];
}
// Warp reduction
sum = warp_reduce_sum(sum);
__shared__ float warp_sums[32];
int warp_id = threadIdx.x / 32;
int lane = threadIdx.x % 32;
if (lane == 0) warp_sums[warp_id] = sum;
__syncthreads();
if (warp_id == 0) {
sum = (lane < blockDim.x / 32) ? warp_sums[lane] : 0.0f;
sum = warp_reduce_sum(sum);
if (lane == 0) atomicAdd(result, sum);
}
}
CUB Library: Production-Quality Reductions
For production code, use CUB (CUDA UnBound) or Thrust:
#include <cub/cub.cuh>
void cub_reduction(const float* d_input, float* d_output, int n) {
// Step 1: Query temp storage size
void* d_temp = nullptr;
size_t temp_bytes = 0;
cub::DeviceReduce::Sum(d_temp, temp_bytes, d_input, d_output, n);
// Step 2: Allocate temp storage
cudaMalloc(&d_temp, temp_bytes);
// Step 3: Run reduction
cub::DeviceReduce::Sum(d_temp, temp_bytes, d_input, d_output, n);
cudaFree(d_temp);
}
// Block-level CUB reduction (composable within your own kernels)
__global__ void kernel_with_cub_block_reduce(const float* data,
float* block_results, int n) {
typedef cub::BlockReduce<float, 256> BlockReduce;
__shared__ typename BlockReduce::TempStorage temp_storage;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = (idx < n) ? data[idx] : 0.0f;
// Block-level sum
float block_sum = BlockReduce(temp_storage).Sum(val);
if (threadIdx.x == 0) {
block_results[blockIdx.x] = block_sum;
}
}
// Warp-level CUB reduction
__global__ void kernel_with_cub_warp_reduce(const float* data,
float* warp_results, int n) {
typedef cub::WarpReduce<float> WarpReduce;
__shared__ typename WarpReduce::TempStorage temp_storage[8]; // 8 warps
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int warp_id = threadIdx.x / 32;
float val = (idx < n) ? data[idx] : 0.0f;
float warp_sum = WarpReduce(temp_storage[warp_id]).Sum(val);
if (threadIdx.x % 32 == 0) {
int global_warp = blockIdx.x * (blockDim.x / 32) + warp_id;
warp_results[global_warp] = warp_sum;
}
}
CUB’s reductions are auto-tuned per architecture, handle all edge cases, and use vectorized loads. For production code, cub::DeviceReduce::Sum, cub::DeviceReduce::Max, and cub::DeviceReduce::ArgMax are the right choice. Write custom reductions when you need to fuse reduction with other operations (e.g., reduce + scale + bias in a single kernel).
Template-Based Generic Reduction
A reusable reduction that works with any associative operator:
template <typename T, typename Op>
__device__ T warp_reduce(T val, Op op) {
for (int offset = 16; offset > 0; offset >>= 1) {
T other = __shfl_down_sync(0xffffffff, val, offset);
val = op(val, other);
}
return val;
}
template <typename T, typename Op>
__global__ void generic_reduce(const T* __restrict__ data,
T* result, int n, Op op, T identity) {
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
T acc = identity;
for (int i = idx; i < n; i += stride) {
acc = op(acc, data[i]);
}
acc = warp_reduce(acc, op);
__shared__ T warp_results[32];
int warp_id = tid / 32;
int lane = tid % 32;
if (lane == 0) warp_results[warp_id] = acc;
__syncthreads();
if (warp_id == 0) {
acc = (lane < blockDim.x / 32) ? warp_results[lane] : identity;
acc = warp_reduce(acc, op);
if (lane == 0) atomicAdd(result, acc); // Only works for sum
}
}
// Usage:
struct MaxOp {
__device__ float operator()(float a, float b) { return fmaxf(a, b); }
};
struct SumOp {
__device__ float operator()(float a, float b) { return a + b; }
};
// generic_reduce<<<grid, block>>>(d_data, d_result, n, SumOp{}, 0.0f);
// generic_reduce<<<grid, block>>>(d_data, d_result, n, MaxOp{}, -INFINITY);
Summary
Parallel reduction on CUDA progresses through clear optimization levels: (1) sequential addressing eliminates divergence, (2) first-add-during-load doubles elements per block, (3) warp shuffle replaces shared memory for the final 32 threads, (4) grid-stride loops with vectorized loads maximize memory bandwidth. The fully optimized version reaches 88-93% of peak memory bandwidth because reduction is entirely memory-bound — the computation (one addition per loaded element) is trivial compared to the data movement. For histograms, per-warp privatization reduces atomic contention by orders of magnitude. In production, CUB provides auto-tuned implementations; write custom reductions when fusing with other operations.