A histogram kernel with 10,000 threads all calling atomicAdd(&bin[x], 1) achieves 2 GB/s on an A100. The bottleneck: atomic contention. When multiple threads atomic-add to the same bin, the hardware serializes the operations. Each atomic costs 400 cycles, and with heavy contention, effective throughput collapses. Restructure with hierarchical reduction β warp-level reduction using shuffles, block-level reduction using shared memory, then one atomic per block β and throughput jumps to 180 GB/s. The 90x speedup comes from replacing 10,000 contentious global atomics with 10,000 register operations + 320 shared memory operations + 320 global atomics.
The solution is hierarchical reduction: first reduce within a warp using shuffle instructions, then across warps within a block using shared memory, and finally across blocks using a small number of global atomics. This post covers atomic operations, their performance characteristics, and complete implementations of histogram and parallel prefix sum.
Atomic Operations: Performance Characteristics
atomicAdd Throughput Under Contention
#include <cuda_runtime.h>
#include <cstdio>
// Benchmark atomicAdd to a single address (worst case: max contention)
__global__ void atomic_single_address(int* counter, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
atomicAdd(counter, 1);
}
}
// Benchmark atomicAdd to many addresses (best case: no contention)
__global__ void atomic_distributed(int* counters, int n, int num_bins) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
int bin = idx % num_bins;
atomicAdd(&counters[bin], 1);
}
}
// Host benchmark
void benchmark_atomics() {
int n = 1 << 20; // 1M threads
int* d_counter;
cudaMalloc(&d_counter, sizeof(int));
cudaMemset(d_counter, 0, sizeof(int));
int block = 256;
int grid = (n + block - 1) / block;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Single address (maximum contention)
cudaEventRecord(start);
atomic_single_address<<<grid, block>>>(d_counter, n);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms_single;
cudaEventElapsedTime(&ms_single, start, stop);
// Distributed across 1024 bins
int* d_bins;
cudaMalloc(&d_bins, 1024 * sizeof(int));
cudaMemset(d_bins, 0, 1024 * sizeof(int));
cudaEventRecord(start);
atomic_distributed<<<grid, block>>>(d_bins, n, 1024);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms_distributed;
cudaEventElapsedTime(&ms_distributed, start, stop);
printf("1M atomicAdd to 1 address: %.3f ms\n", ms_single);
printf("1M atomicAdd to 1024 bins: %.3f ms\n", ms_distributed);
printf("Speedup from distribution: %.1fx\n",
ms_single / ms_distributed);
cudaFree(d_counter);
cudaFree(d_bins);
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
atomicAdd Throughput vs Number of Contended Addresses (A100)
| Target Addresses | Time (1M ops) | Throughput (Gops/s) | Relative |
|---|---|---|---|
| 1 | 3.2 ms | 0.33 | 1.0x (worst) |
| 32 | 0.42 ms | 2.5 | 7.6x |
| 256 | 0.11 ms | 9.5 | 29x |
| 1024 | 0.048 ms | 21.8 | 66x |
| 65536 | 0.021 ms | 49.8 | 151x |
| 1048576 (1 per thread) | 0.018 ms | 58.1 | 176x (best) |
Atomic Types and Their Costs
// Available atomic operations and their hardware support
void atomic_catalog() {
// Integer atomics (all architectures, fast L2 path)
// atomicAdd, atomicSub, atomicMin, atomicMax,
// atomicAnd, atomicOr, atomicXor, atomicCAS, atomicExch
// FP32 atomicAdd (Kepler+)
// Uses compare-and-swap internally on older architectures
// Native hardware unit on Volta+ (fast)
// FP16 atomicAdd (Volta+, sm_70)
// Native hardware support for half-precision atomic add
// FP64 atomicAdd (Pascal+)
// Much slower than FP32 on most architectures
// atomicCAS (Compare-And-Swap) -- universal building block
// Can implement any atomic operation as a CAS loop:
// do {
// old = *addr;
// assumed = old;
// new_val = f(old, my_value);
// } while (atomicCAS(addr, assumed, new_val) != assumed);
}
Warp-Level Reduction
Shuffle-Based Reduction
Warp shuffle instructions (__shfl_xor_sync, __shfl_down_sync) exchange register values between threads within a warp at register-file speed (single cycle latency). No shared memory required.
// Warp-level sum reduction using butterfly pattern
__device__ float warp_reduce_sum(float val) {
// Butterfly reduction: each step halves the number of
// active participants
#pragma unroll
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val += __shfl_xor_sync(0xFFFFFFFF, val, offset);
}
return val; // All threads in the warp have the final sum
}
// Warp-level max reduction
__device__ float warp_reduce_max(float val) {
#pragma unroll
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val = fmaxf(val, __shfl_xor_sync(0xFFFFFFFF, val, offset));
}
return val;
}
// Warp-level sum with mask (for partial warps)
__device__ float warp_reduce_sum_masked(float val, unsigned mask) {
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
val += __shfl_xor_sync(mask, val, offset);
}
return val;
}
Why Warp Reductions Are Fast
def warp_reduction_analysis():
"""Analyze warp reduction cost."""
# __shfl_xor_sync: 1 cycle latency (register-to-register)
# 5 steps for 32 threads (log2(32) = 5)
# Total: 5 cycles per warp
# Compare to shared memory reduction:
# Shared memory write: ~20 cycles
# __syncthreads: ~20 cycles
# Shared memory read: ~20 cycles
# Repeat log2(32) = 5 times
# Total: ~300 cycles per warp
print("Warp shuffle reduction: ~5 cycles")
print("Shared memory reduction: ~300 cycles")
print("Speedup: ~60x")
print()
print("But shared memory handles arbitrary block sizes,")
print("while shuffles only work within a single warp (32 threads).")
Block-Level Reduction
Combining Warp and Block Reductions
For blocks larger than one warp, reduce within each warp first, then combine warp results using shared memory:
// Block-level sum reduction (up to 1024 threads per block)
template <int BLOCK_SIZE>
__device__ float block_reduce_sum(float val) {
// Step 1: Warp-level reduction
val = warp_reduce_sum(val);
// Step 2: Store warp results to shared memory
__shared__ float warp_results[32]; // Max 32 warps per block
int warp_id = threadIdx.x / warpSize;
int lane = threadIdx.x % warpSize;
if (lane == 0) {
warp_results[warp_id] = val;
}
__syncthreads();
// Step 3: First warp reduces across all warp results
constexpr int NUM_WARPS = BLOCK_SIZE / 32;
if (warp_id == 0) {
val = (lane < NUM_WARPS) ? warp_results[lane] : 0.0f;
val = warp_reduce_sum(val);
}
return val; // Result is in thread 0 of the block
}
// Block-level max reduction
template <int BLOCK_SIZE>
__device__ float block_reduce_max(float val) {
val = warp_reduce_max(val);
__shared__ float warp_results[32];
int warp_id = threadIdx.x / warpSize;
int lane = threadIdx.x % warpSize;
if (lane == 0) {
warp_results[warp_id] = val;
}
__syncthreads();
constexpr int NUM_WARPS = BLOCK_SIZE / 32;
if (warp_id == 0) {
val = (lane < NUM_WARPS) ? warp_results[lane] : -INFINITY;
val = warp_reduce_max(val);
}
return val;
}
Using Block Reduction for Normalization
// Fused RMSNorm using block-level reduction
__global__ void rmsnorm_kernel(
float* __restrict__ output,
const float* __restrict__ input,
const float* __restrict__ weight,
int hidden_dim,
float eps
) {
// One block per row (token)
int row = blockIdx.x;
const float* x = input + row * hidden_dim;
float* y = output + row * hidden_dim;
// Each thread accumulates partial sum-of-squares
float local_ss = 0.0f;
for (int i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
float val = x[i];
local_ss += val * val;
}
// Block-level reduction
float ss = block_reduce_sum<256>(local_ss);
// Broadcast result
__shared__ float rms_scale;
if (threadIdx.x == 0) {
rms_scale = rsqrtf(ss / hidden_dim + eps);
}
__syncthreads();
// Apply normalization
for (int i = threadIdx.x; i < hidden_dim; i += blockDim.x) {
y[i] = x[i] * rms_scale * weight[i];
}
}
Multi-Block Reduction
The Two-Phase Approach
When the data is too large for a single block, you need a multi-block reduction. The standard approach uses two kernel launches:
// Phase 1: Each block reduces a portion of the data
__global__ void reduce_phase1(
const float* __restrict__ input,
float* __restrict__ partial_sums,
int n
) {
float local_sum = 0.0f;
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
// Grid-stride loop: each thread handles multiple elements
for (int i = tid; i < n; i += stride) {
local_sum += input[i];
}
// Block-level reduction
local_sum = block_reduce_sum<256>(local_sum);
// Write block result to global memory
if (threadIdx.x == 0) {
partial_sums[blockIdx.x] = local_sum;
}
}
// Phase 2: Single block reduces partial sums
__global__ void reduce_phase2(
const float* __restrict__ partial_sums,
float* __restrict__ result,
int num_blocks
) {
float local_sum = 0.0f;
for (int i = threadIdx.x; i < num_blocks; i += blockDim.x) {
local_sum += partial_sums[i];
}
local_sum = block_reduce_sum<256>(local_sum);
if (threadIdx.x == 0) {
result[0] = local_sum;
}
}
Single-Kernel Multi-Block Reduction (Cooperative Groups)
With cooperative groups (CUDA 9.0+), you can synchronize across all blocks in a single kernel, avoiding the two-phase approach:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void cooperative_reduce(
const float* __restrict__ input,
float* __restrict__ result,
float* __restrict__ partial_sums,
int n
) {
cg::grid_group grid = cg::this_grid();
float local_sum = 0.0f;
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
for (int i = tid; i < n; i += stride) {
local_sum += input[i];
}
// Block-level reduction
local_sum = block_reduce_sum<256>(local_sum);
if (threadIdx.x == 0) {
partial_sums[blockIdx.x] = local_sum;
}
// Grid-wide synchronization
grid.sync();
// Block 0 reduces partial sums
if (blockIdx.x == 0) {
local_sum = 0.0f;
for (int i = threadIdx.x; i < gridDim.x; i += blockDim.x) {
local_sum += partial_sums[i];
}
local_sum = block_reduce_sum<256>(local_sum);
if (threadIdx.x == 0) {
result[0] = local_sum;
}
}
}
Cooperative kernel launch requires that the grid size does not exceed the deviceβs maximum active blocks. This means you cannot launch an arbitrarily large grid. Use cudaOccupancyMaxActiveBlocksPerMultiprocessor to determine the maximum grid size.
Implementation 1: Histogram Kernel
Naive Histogram (Global Atomics)
// Naive: every thread atomicAdd to global histogram
__global__ void histogram_naive(
const int* __restrict__ data,
int* __restrict__ hist,
int n,
int num_bins
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
int bin = data[idx] % num_bins;
atomicAdd(&hist[bin], 1);
}
}
Optimized Histogram (Shared Memory + Warp-Level)
// Optimized: per-block histogram in shared memory,
// then atomicAdd to global histogram
template <int NUM_BINS>
__global__ void histogram_optimized(
const int* __restrict__ data,
int* __restrict__ hist,
int n
) {
// Per-block histogram in shared memory
__shared__ int local_hist[NUM_BINS];
// Initialize shared memory histogram
for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
local_hist[i] = 0;
}
__syncthreads();
// Each thread processes multiple elements (grid-stride loop)
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
for (int i = tid; i < n; i += stride) {
int bin = data[i] % NUM_BINS;
// Atomic add to shared memory (much faster than global)
atomicAdd(&local_hist[bin], 1);
}
__syncthreads();
// Write per-block histogram to global histogram
// Only NUM_BINS global atomics per block (not n/gridSize)
for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
if (local_hist[i] > 0) {
atomicAdd(&hist[i], local_hist[i]);
}
}
}
// Even more optimized: per-warp histograms to reduce
// shared memory contention
template <int NUM_BINS>
__global__ void histogram_warp_privatized(
const int* __restrict__ data,
int* __restrict__ hist,
int n
) {
// Each warp gets its own histogram in shared memory
constexpr int WARPS_PER_BLOCK = 8; // blockDim.x / 32
__shared__ int warp_hist[WARPS_PER_BLOCK][NUM_BINS];
int warp_id = threadIdx.x / 32;
int lane = threadIdx.x % 32;
// Initialize
for (int i = lane; i < NUM_BINS; i += 32) {
warp_hist[warp_id][i] = 0;
}
__syncwarp();
// Accumulate into per-warp histogram
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
for (int i = tid; i < n; i += stride) {
int bin = data[i] % NUM_BINS;
atomicAdd(&warp_hist[warp_id][bin], 1);
}
__syncthreads();
// Reduce per-warp histograms to global
for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
int sum = 0;
for (int w = 0; w < WARPS_PER_BLOCK; w++) {
sum += warp_hist[w][i];
}
if (sum > 0) {
atomicAdd(&hist[i], sum);
}
}
}
Histogram Kernel Performance (10M elements, 256 bins, A100)
| Implementation | Time (ms) | Speedup | Global Atomics |
|---|---|---|---|
| Naive (global atomicAdd) | 1.82 | 1.0x | 10M |
| Block shared memory | 0.28 | 6.5x | ~1K (grid blocks * bins/block) |
| Warp-privatized | 0.19 | 9.6x | ~1K |
| CUB DeviceHistogram | 0.16 | 11.4x | Implementation-specific |
Implementation 2: Parallel Prefix Sum (Scan)
The Algorithm
Parallel prefix sum (inclusive scan) computes: given input , output . This is a fundamental primitive used in stream compaction, radix sort, and attention mask generation.
Block-Level Scan
// Blelloch-style inclusive scan within a block
template <int BLOCK_SIZE>
__device__ void block_inclusive_scan(
float* __restrict__ data,
float* __restrict__ shared_data
) {
int tid = threadIdx.x;
// Load data to shared memory
shared_data[tid] = (tid < BLOCK_SIZE) ? data[tid] : 0.0f;
__syncthreads();
// Up-sweep (reduce) phase
for (int stride = 1; stride < BLOCK_SIZE; stride *= 2) {
int index = (tid + 1) * stride * 2 - 1;
if (index < BLOCK_SIZE) {
shared_data[index] += shared_data[index - stride];
}
__syncthreads();
}
// Down-sweep phase
for (int stride = BLOCK_SIZE / 4; stride > 0; stride /= 2) {
int index = (tid + 1) * stride * 2 - 1;
if (index + stride < BLOCK_SIZE) {
shared_data[index + stride] += shared_data[index];
}
__syncthreads();
}
// Write back
if (tid < BLOCK_SIZE) {
data[tid] = shared_data[tid];
}
}
Multi-Block Scan (Three-Phase)
// Phase 1: Block-level scan + save block sums
__global__ void scan_phase1(
float* __restrict__ data,
float* __restrict__ block_sums,
int n
) {
__shared__ float shared[1024];
int block_start = blockIdx.x * blockDim.x;
int tid = threadIdx.x;
int gid = block_start + tid;
// Load
shared[tid] = (gid < n) ? data[gid] : 0.0f;
__syncthreads();
// Inclusive scan within block (Hillis-Steele)
for (int stride = 1; stride < blockDim.x; stride *= 2) {
float temp = 0.0f;
if (tid >= stride) {
temp = shared[tid - stride];
}
__syncthreads();
shared[tid] += temp;
__syncthreads();
}
// Write scanned data back
if (gid < n) {
data[gid] = shared[tid];
}
// Save the last element as this block's sum
if (tid == blockDim.x - 1) {
block_sums[blockIdx.x] = shared[tid];
}
}
// Phase 2: Scan the block sums themselves
// (Use phase 1 on the block_sums array; if more than one block
// is needed, recurse)
// Phase 3: Add scanned block sums to each block's elements
__global__ void scan_phase3(
float* __restrict__ data,
const float* __restrict__ scanned_block_sums,
int n
) {
if (blockIdx.x == 0) return; // First block has no prefix to add
int gid = blockIdx.x * blockDim.x + threadIdx.x;
if (gid < n) {
data[gid] += scanned_block_sums[blockIdx.x - 1];
}
}
// Host-side orchestration
void parallel_inclusive_scan(float* d_data, int n) {
int block_size = 1024;
int num_blocks = (n + block_size - 1) / block_size;
float* d_block_sums;
cudaMalloc(&d_block_sums, num_blocks * sizeof(float));
// Phase 1: Scan within blocks
scan_phase1<<<num_blocks, block_size>>>(d_data, d_block_sums, n);
// Phase 2: Scan block sums
if (num_blocks > 1) {
if (num_blocks <= block_size) {
// Single block can handle all block sums
float* d_dummy;
cudaMalloc(&d_dummy, sizeof(float));
scan_phase1<<<1, num_blocks>>>(d_block_sums, d_dummy,
num_blocks);
cudaFree(d_dummy);
} else {
// Recursive scan for very large arrays
parallel_inclusive_scan(d_block_sums, num_blocks);
}
}
// Phase 3: Propagate block sums
scan_phase3<<<num_blocks, block_size>>>(
d_data, d_block_sums, n
);
cudaFree(d_block_sums);
}
Warp-Level Scan (Faster for Small Arrays)
// Inclusive scan within a warp using __shfl_up_sync
__device__ float warp_inclusive_scan(float val) {
unsigned mask = 0xFFFFFFFF;
#pragma unroll
for (int offset = 1; offset < 32; offset *= 2) {
float n = __shfl_up_sync(mask, val, offset);
if (threadIdx.x % 32 >= offset) {
val += n;
}
}
return val;
}
// Optimized block scan using warp scan as building block
template <int BLOCK_SIZE>
__global__ void optimized_block_scan(
float* __restrict__ data,
float* __restrict__ block_sums,
int n
) {
constexpr int NUM_WARPS = BLOCK_SIZE / 32;
__shared__ float warp_sums[NUM_WARPS];
int gid = blockIdx.x * BLOCK_SIZE + threadIdx.x;
int warp_id = threadIdx.x / 32;
int lane = threadIdx.x % 32;
// Load
float val = (gid < n) ? data[gid] : 0.0f;
// Warp-level inclusive scan
val = warp_inclusive_scan(val);
// Save warp sums
if (lane == 31) {
warp_sums[warp_id] = val;
}
__syncthreads();
// Scan warp sums (first warp only)
if (warp_id == 0 && lane < NUM_WARPS) {
float ws = warp_sums[lane];
ws = warp_inclusive_scan(ws);
warp_sums[lane] = ws;
}
__syncthreads();
// Add prefix from previous warps
if (warp_id > 0) {
val += warp_sums[warp_id - 1];
}
// Write back
if (gid < n) {
data[gid] = val;
}
// Save block sum
if (threadIdx.x == BLOCK_SIZE - 1) {
block_sums[blockIdx.x] = val;
}
}
Parallel Prefix Sum Performance (16M elements, FP32, A100)
| Implementation | Time (ms) | Bandwidth (GB/s) | Relative |
|---|---|---|---|
| Naive Hillis-Steele scan | 2.45 | 52.4 | 1.0x |
| Blelloch with shared memory | 0.82 | 156.6 | 3.0x |
| Warp-scan + block-scan | 0.51 | 251.7 | 4.8x |
| CUB DeviceScan | 0.38 | 337.5 | 6.4x |
Reduction Pattern Performance Hierarchy
(relative throughput)Practical Guidelines
Choosing the Right Reduction Strategy
def reduction_strategy_guide():
"""Guide for choosing reduction implementation."""
strategies = {
"32 or fewer values": {
"method": "Single warp shuffle reduction",
"cost": "5 cycles",
"example": "Reduce within a warp",
},
"33-1024 values": {
"method": "Block reduction (warp shuffle + shared memory)",
"cost": "~50 cycles",
"example": "Reduce hidden dimension for LayerNorm",
},
"1K-1M values": {
"method": "Multi-block: block reduction + global atomic",
"cost": "~500 cycles + 1 atomic per block",
"example": "Loss reduction, gradient accumulation",
},
"More than 1M values": {
"method": "Two-phase: block reduction + scan of block sums",
"cost": "2 kernel launches",
"example": "Full-array reduction, histogram",
},
}
for size_range, info in strategies.items():
print(f"\n{size_range}:")
for key, val in info.items():
print(f" {key}: {val}")
NVIDIAβs CUB library (cub::DeviceReduce, cub::DeviceScan, cub::DeviceHistogram) provides highly optimized implementations of all reduction patterns, auto-tuned for each GPU architecture. Write custom reduction kernels only when you need to fuse the reduction with other operations (e.g., LayerNorm = reduction + elementwise) or when the CUB API does not cover your specific pattern.
Summary
Atomic operations are simple but scale poorly under contention. The optimization hierarchy is: warp-level shuffle reductions (register speed, 32 threads), block-level reductions (shared memory, up to 1024 threads), and multi-block reductions (global memory coordination, unlimited threads). Each level reduces the number of global atomics by a factor of 32-256, producing 10-100x speedups over naive global atomics.
The two concrete implementations β histogram and prefix sum β demonstrate these patterns in practice. The histogram kernel moves from 10M global atomics (naive) to under 1K (warp-privatized shared memory histograms merged with a single round of global atomics). The prefix sum uses warp-level scans as building blocks for an efficient multi-block scan that achieves 250+ GB/s effective bandwidth on an A100.