Computing a prefix sum sequentially requires N additions. Parallelizing it requires 2N additions but completes in steps instead of N — a 100x speedup for a million-element array. The Blelloch scan algorithm achieves this with an up-sweep phase that builds a tree of partial sums, followed by a down-sweep that distributes the final values. The catch: coordinating across thread blocks requires either a slow global barrier or the decoupled lookback technique, where blocks speculatively process their data while polling for predecessor results. CUB’s DeviceScan uses decoupled lookback to achieve 1.9 TB/s on an A100 — 93% of peak memory bandwidth for an algorithm with inherent sequential dependencies.
All measurements target NVIDIA Ampere (A100-80GB SXM, SM 8.0) unless stated otherwise.
Sequential Scan (Baseline)
#include <cuda_runtime.h>
// Sequential exclusive prefix sum — single thread, O(n)
__global__ void scan_sequential(const float* input, float* output, int n) {
if (threadIdx.x != 0 || blockIdx.x != 0) return;
float sum = 0.0f;
for (int i = 0; i < n; i++) {
output[i] = sum;
sum += input[i];
}
}
// ~10 GB/s on A100 — 0.5% of peak
Hillis-Steele Inclusive Scan (Simple, Not Work-Efficient)
The Hillis-Steele algorithm performs steps. In step , each element adds the element at position :
// Hillis-Steele: O(n log n) work, O(log n) depth
// Simple but not work-efficient — 2x the adds of sequential
__global__ void scan_hillis_steele(float* data, int n) {
extern __shared__ float temp[];
int tid = threadIdx.x;
temp[tid] = (tid < n) ? data[tid] : 0.0f;
__syncthreads();
for (int offset = 1; offset < n; offset <<= 1) {
float val = 0.0f;
if (tid >= offset) {
val = temp[tid - offset];
}
__syncthreads(); // Read before write
if (tid >= offset) {
temp[tid] += val;
}
__syncthreads(); // Write before next read
}
if (tid < n) data[tid] = temp[tid];
}
This is an inclusive scan. Each element includes itself in the sum. The algorithm performs additions — for , that is additions versus for sequential. The extra work is the price of parallelism.
Blelloch Work-Efficient Exclusive Scan
The Blelloch algorithm performs work (same as sequential) in depth. It has two phases:
- Up-sweep (reduce): Build a reduction tree, summing pairs bottom-up
- Down-sweep: Traverse the tree top-down, distributing prefix sums
// Blelloch exclusive scan: O(n) work, O(log n) depth
__global__ void scan_blelloch(float* data, float* output, int n) {
extern __shared__ float temp[];
int tid = threadIdx.x;
int ai = tid;
int bi = tid + n / 2;
// Load input to shared memory
temp[ai] = (ai < n) ? data[ai] : 0.0f;
temp[bi] = (bi < n) ? data[bi] : 0.0f;
// === UP-SWEEP (Reduce) ===
int offset = 1;
for (int d = n >> 1; d > 0; d >>= 1) {
__syncthreads();
if (tid < d) {
int ai_idx = offset * (2 * tid + 1) - 1;
int bi_idx = offset * (2 * tid + 2) - 1;
temp[bi_idx] += temp[ai_idx];
}
offset <<= 1;
}
// Set root to zero (exclusive scan starts with identity)
if (tid == 0) temp[n - 1] = 0.0f;
// === DOWN-SWEEP ===
for (int d = 1; d < n; d <<= 1) {
offset >>= 1;
__syncthreads();
if (tid < d) {
int ai_idx = offset * (2 * tid + 1) - 1;
int bi_idx = offset * (2 * tid + 2) - 1;
float t = temp[ai_idx];
temp[ai_idx] = temp[bi_idx];
temp[bi_idx] += t;
}
}
__syncthreads();
// Write output
if (ai < n) output[ai] = temp[ai];
if (bi < n) output[bi] = temp[bi];
}
Visualizing Blelloch for 8 Elements
Input: [3, 1, 7, 0, 4, 1, 6, 3]
UP-SWEEP (reduction):
Step 1: [3, 4, 7, 7, 4, 5, 6, 9] (pairs summed)
Step 2: [3, 4, 7, 11, 4, 5, 6, 14] (pairs of pairs)
Step 3: [3, 4, 7, 11, 4, 5, 6, 25] (total at root)
Set root to 0: [3, 4, 7, 11, 4, 5, 6, 0]
DOWN-SWEEP:
Step 1: [3, 4, 7, 0, 4, 5, 6, 11] (swap and add at root)
Step 2: [3, 0, 7, 4, 4, 11, 6, 16] (propagate down)
Step 3: [0, 3, 4, 11, 11, 15, 16, 22] (final exclusive scan)
Verify: [0, 3, 3+1, 3+1+7, 3+1+7+0, 3+1+7+0+4, ...]
= [0, 3, 4, 11, 11, 15, 16, 22] ✓
Bank-Conflict-Free Blelloch Scan
The basic Blelloch implementation has shared memory bank conflicts. On each step, threads access elements with power-of-2 strides, which map to the same banks:
// Bank-conflict-free version: add padding to break stride patterns
#define NUM_BANKS 32
#define LOG_NUM_BANKS 5
#define CONFLICT_FREE_OFFSET(n) ((n) >> LOG_NUM_BANKS)
__global__ void scan_blelloch_bcf(float* data, float* output, int n) {
extern __shared__ float temp[];
int tid = threadIdx.x;
int ai = tid;
int bi = tid + (n / 2);
// Add padding to indices to avoid bank conflicts
int bankOffsetA = CONFLICT_FREE_OFFSET(ai);
int bankOffsetB = CONFLICT_FREE_OFFSET(bi);
temp[ai + bankOffsetA] = (ai < n) ? data[ai + blockIdx.x * n] : 0.0f;
temp[bi + bankOffsetB] = (bi < n) ? data[bi + blockIdx.x * n] : 0.0f;
// UP-SWEEP
int offset = 1;
for (int d = n >> 1; d > 0; d >>= 1) {
__syncthreads();
if (tid < d) {
int ai_idx = offset * (2 * tid + 1) - 1;
int bi_idx = offset * (2 * tid + 2) - 1;
ai_idx += CONFLICT_FREE_OFFSET(ai_idx);
bi_idx += CONFLICT_FREE_OFFSET(bi_idx);
temp[bi_idx] += temp[ai_idx];
}
offset <<= 1;
}
if (tid == 0) {
int last = n - 1 + CONFLICT_FREE_OFFSET(n - 1);
temp[last] = 0.0f;
}
// DOWN-SWEEP
for (int d = 1; d < n; d <<= 1) {
offset >>= 1;
__syncthreads();
if (tid < d) {
int ai_idx = offset * (2 * tid + 1) - 1;
int bi_idx = offset * (2 * tid + 2) - 1;
ai_idx += CONFLICT_FREE_OFFSET(ai_idx);
bi_idx += CONFLICT_FREE_OFFSET(bi_idx);
float t = temp[ai_idx];
temp[ai_idx] = temp[bi_idx];
temp[bi_idx] += t;
}
}
__syncthreads();
if (ai < n) output[ai + blockIdx.x * n] = temp[ai + bankOffsetA];
if (bi < n) output[bi + blockIdx.x * n] = temp[bi + bankOffsetB];
}
Intra-Block Scan: Basic vs Bank-Conflict-Free (A100, 1024 elements/block)
| Version | Time per Block (ns) | Bank Conflicts | Speedup |
|---|---|---|---|
| Hillis-Steele | 1850 | Many (log n steps) | 1.0x |
| Blelloch (with conflicts) | 1420 | Moderate | 1.30x |
| Blelloch (BCF) | 1080 | None | 1.71x |
Warp-Level Scan
For tiles of 32 or fewer elements, use warp shuffle for the fastest possible scan:
// Inclusive scan within a warp using shuffle
__device__ float warp_scan_inclusive(float val) {
int lane = threadIdx.x & 31;
// Kogge-Stone pattern
float n = __shfl_up_sync(0xffffffff, val, 1);
if (lane >= 1) val += n;
n = __shfl_up_sync(0xffffffff, val, 2);
if (lane >= 2) val += n;
n = __shfl_up_sync(0xffffffff, val, 4);
if (lane >= 4) val += n;
n = __shfl_up_sync(0xffffffff, val, 8);
if (lane >= 8) val += n;
n = __shfl_up_sync(0xffffffff, val, 16);
if (lane >= 16) val += n;
return val; // Lane 31 has the total
}
// Exclusive scan: shift inclusive result right and insert identity
__device__ float warp_scan_exclusive(float val) {
float inclusive = warp_scan_inclusive(val);
float exclusive = __shfl_up_sync(0xffffffff, inclusive, 1);
int lane = threadIdx.x & 31;
return (lane == 0) ? 0.0f : exclusive;
}
Block-Level Scan with Warp Scans
Combine warp-level scans for a fast block-level scan:
__global__ void block_scan_warp(const float* input, float* output,
float* block_sums, int n) {
__shared__ float warp_totals[32]; // One per warp
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int warp_id = tid / 32;
int lane = tid & 31;
float val = (idx < n) ? input[idx] : 0.0f;
// Step 1: Inclusive scan within each warp
float warp_result = warp_scan_inclusive(val);
// Step 2: Last lane of each warp stores warp total
if (lane == 31) {
warp_totals[warp_id] = warp_result;
}
__syncthreads();
// Step 3: First warp scans the warp totals
if (warp_id == 0) {
float wt = (lane < blockDim.x / 32) ? warp_totals[lane] : 0.0f;
float scanned_wt = warp_scan_inclusive(wt);
warp_totals[lane] = scanned_wt;
}
__syncthreads();
// Step 4: Add the prefix of previous warps to each element
float block_prefix = (warp_id > 0) ? warp_totals[warp_id - 1] : 0.0f;
float exclusive_result = warp_result - val + block_prefix; // Convert inclusive to exclusive + add prefix
if (idx < n) output[idx] = exclusive_result;
// Store block total for multi-block scan
if (tid == blockDim.x - 1 && block_sums != nullptr) {
block_sums[blockIdx.x] = warp_result + block_prefix; // Inclusive total
}
}
Multi-Block Scan: Three-Pass Approach
For arrays larger than one block, scan requires coordination across blocks. The classic approach uses three kernels:
#include <cuda_runtime.h>
// Pass 1: Scan within each block, store block sums
// (Uses block_scan_warp from above)
// Pass 2: Scan the block sums
// (Same kernel with block_sums as input)
// Pass 3: Add block prefixes to each element
__global__ void add_block_prefix(float* data, const float* block_prefixes, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n && blockIdx.x > 0) {
data[idx] += block_prefixes[blockIdx.x - 1];
}
}
void multi_block_scan(const float* d_input, float* d_output, int n) {
int block_size = 256;
int num_blocks = (n + block_size - 1) / block_size;
float* d_block_sums;
float* d_scanned_block_sums;
cudaMalloc(&d_block_sums, num_blocks * sizeof(float));
cudaMalloc(&d_scanned_block_sums, num_blocks * sizeof(float));
// Pass 1: Per-block scan
block_scan_warp<<<num_blocks, block_size>>>(
d_input, d_output, d_block_sums, n);
// Pass 2: Scan block sums (single block if num_blocks <= 1024)
if (num_blocks <= block_size) {
block_scan_warp<<<1, block_size>>>(
d_block_sums, d_scanned_block_sums, nullptr, num_blocks);
} else {
// Recursive multi-block scan for very large arrays
multi_block_scan(d_block_sums, d_scanned_block_sums, num_blocks);
}
// Pass 3: Add block prefixes
add_block_prefix<<<num_blocks, block_size>>>(
d_output, d_scanned_block_sums, n);
cudaFree(d_block_sums);
cudaFree(d_scanned_block_sums);
}
The three-pass approach requires 3 kernel launches and 3 full passes over the data (or portions of it). For small arrays, the launch overhead dominates. For large arrays, the three passes result in 3x the memory traffic compared to the theoretical minimum of 1 read + 1 write.
Decoupled Lookback: Single-Pass Multi-Block Scan
The state-of-the-art technique for multi-block scan is decoupled lookback (Merrill and Garland, 2016). Each block publishes its local aggregate and looks back at previous blocks to determine its prefix. This achieves a single-pass scan:
#include <cuda_runtime.h>
// Status flags for inter-block communication
enum ScanStatus : int {
STATUS_INVALID = 0,
STATUS_AGGREGATE = 1,
STATUS_PREFIX = 2
};
struct ScanState {
float value;
int status; // STATUS_INVALID, STATUS_AGGREGATE, or STATUS_PREFIX
};
// Decoupled lookback scan
__global__ void scan_decoupled_lookback(const float* __restrict__ input,
float* __restrict__ output,
volatile ScanState* tile_state,
int n) {
__shared__ float warp_totals[32];
int tid = threadIdx.x;
int tile_id = blockIdx.x;
int idx = tile_id * blockDim.x + tid;
int lane = tid & 31;
int warp_id = tid / 32;
// Step 1: Load and compute local inclusive scan
float val = (idx < n) ? input[idx] : 0.0f;
float warp_result = warp_scan_inclusive(val);
if (lane == 31) warp_totals[warp_id] = warp_result;
__syncthreads();
float block_total = 0.0f;
if (warp_id == 0) {
float wt = (lane < blockDim.x / 32) ? warp_totals[lane] : 0.0f;
float scanned_wt = warp_scan_inclusive(wt);
warp_totals[lane] = scanned_wt;
if (lane == (blockDim.x / 32) - 1) {
block_total = scanned_wt;
}
}
__syncthreads();
// Step 2: Publish this block's aggregate
if (tid == 0) {
ScanState s;
if (tile_id == 0) {
s.value = block_total;
s.status = STATUS_PREFIX; // First tile: aggregate IS the prefix
} else {
s.value = block_total;
s.status = STATUS_AGGREGATE;
}
// Use volatile write to ensure visibility
tile_state[tile_id].value = s.value;
__threadfence(); // Ensure value is visible before status
tile_state[tile_id].status = s.status;
}
// Step 3: Lookback to determine prefix (only for tile_id > 0)
__shared__ float exclusive_prefix;
if (tile_id > 0 && tid == 0) {
float prefix = 0.0f;
// Walk backwards through preceding tiles
int lookback_id = tile_id - 1;
while (lookback_id >= 0) {
// Spin until predecessor publishes its state
int status;
float value;
do {
status = tile_state[lookback_id].status;
} while (status == STATUS_INVALID);
value = tile_state[lookback_id].value;
if (status == STATUS_PREFIX) {
// Found a tile with a complete prefix — we are done
prefix += value;
break;
} else {
// STATUS_AGGREGATE: add this tile's aggregate and continue looking back
prefix += value;
lookback_id--;
}
}
exclusive_prefix = prefix;
// Publish our complete prefix for subsequent tiles
tile_state[tile_id].value = prefix + block_total;
__threadfence();
tile_state[tile_id].status = STATUS_PREFIX;
}
__syncthreads();
// Step 4: Add prefix to local scan results
float block_prefix = (warp_id > 0) ? warp_totals[warp_id - 1] : 0.0f;
float total_prefix = (tile_id > 0) ? exclusive_prefix : 0.0f;
float exclusive_result = warp_result - val + block_prefix + total_prefix;
if (idx < n) output[idx] = exclusive_result;
}
// Launch
void launch_decoupled_lookback(const float* d_input, float* d_output, int n) {
int block_size = 256;
int num_blocks = (n + block_size - 1) / block_size;
// Allocate and initialize tile state
ScanState* d_tile_state;
cudaMalloc(&d_tile_state, num_blocks * sizeof(ScanState));
cudaMemset(d_tile_state, 0, num_blocks * sizeof(ScanState));
scan_decoupled_lookback<<<num_blocks, block_size>>>(
d_input, d_output, d_tile_state, n);
cudaFree(d_tile_state);
}
The lookback only traverses until it finds a tile with STATUS_PREFIX, which on average is just 1-2 tiles back (the previous tile usually computes its prefix quickly). The expected lookback depth is amortized, not . This makes the single-pass approach nearly as efficient as reading the input once and writing the output once.
Scan Applications
Stream Compaction
// Given: input array and a predicate
// Output: elements that satisfy the predicate, compacted contiguously
__global__ void compact_flags(const float* input, int* flags, int n, float threshold) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
flags[idx] = (input[idx] > threshold) ? 1 : 0;
}
}
// After exclusive scan of flags:
// scatter_indices[i] = exclusive_scan(flags)[i]
// if flags[i] == 1: output[scatter_indices[i]] = input[i]
__global__ void scatter(const float* input, const int* flags,
const int* scatter_indices, float* output, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n && flags[idx]) {
output[scatter_indices[idx]] = input[idx];
}
}
Radix Sort (Scan-Based)
// One pass of radix sort using scan:
// 1. Extract bit b from each key
// 2. Compute flag[i] = (key[i] >> b) & 1
// 3. Exclusive scan of flag -> positions for 1-bits
// 4. Total 1-bits = last scan value + last flag
// 5. 0-bit position = i - scan[i]; 1-bit position = (n - total_ones) + scan[i]
__global__ void radix_sort_pass(const int* keys_in, int* keys_out,
const int* scan, int n, int bit) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= n) return;
int key = keys_in[idx];
int flag = (key >> bit) & 1;
int scan_val = scan[idx];
int dest;
if (flag == 0) {
dest = idx - scan_val; // Position among 0-bits
} else {
int total_zeros = n - scan[n - 1] - ((keys_in[n - 1] >> bit) & 1);
dest = total_zeros + scan_val; // Position among 1-bits
}
keys_out[dest] = key;
}
Segmented Scan
Segmented scan resets the prefix sum at segment boundaries:
// head_flags[i] = 1 if element i starts a new segment
__global__ void segmented_scan(const float* input, const int* head_flags,
float* output, int n) {
__shared__ float sdata[256];
__shared__ int sflag[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + tid;
float val = (idx < n) ? input[idx] : 0.0f;
int flag = (idx < n) ? head_flags[idx] : 0;
// Warp-level segmented inclusive scan
int lane = tid & 31;
float scan_val = val;
for (int offset = 1; offset < 32; offset <<= 1) {
float other = __shfl_up_sync(0xffffffff, scan_val, offset);
int other_flag = __shfl_up_sync(0xffffffff, flag, offset);
if (lane >= offset) {
if (flag == 0) {
// No segment boundary between us: add predecessor
scan_val += other;
// Propagate flag from predecessor
flag |= other_flag;
}
// If flag == 1: this is a segment head, do not add
}
}
if (idx < n) {
output[idx] = scan_val;
}
}
Benchmarks: All Scan Variants
Exclusive Scan Performance Comparison (A100, 64M floats)
| Implementation | Bandwidth (GB/s) | Passes | % Peak |
|---|---|---|---|
| Sequential (1 thread) | 10 | 1 | 0.5% |
| Hillis-Steele (1 block) | 85 | 1 (limited to 1024 elements) | 4.4% |
| Blelloch (1 block) | 120 | 1 (limited to 2048 elements) | 6.2% |
| Three-pass multi-block | 820 | 3 | 42.4% |
| Decoupled lookback | 1480 | 1 | 76.5% |
| CUB DeviceScan | 1620 | 1 | 83.7% |
Scan Bandwidth by Implementation
(GB/s)Using CUB for Production Scans
#include <cub/cub.cuh>
void cub_exclusive_scan(const float* d_input, float* d_output, int n) {
void* d_temp = nullptr;
size_t temp_bytes = 0;
// Query temp storage
cub::DeviceScan::ExclusiveSum(d_temp, temp_bytes, d_input, d_output, n);
// Allocate
cudaMalloc(&d_temp, temp_bytes);
// Run
cub::DeviceScan::ExclusiveSum(d_temp, temp_bytes, d_input, d_output, n);
cudaFree(d_temp);
}
// Inclusive scan with custom operator
struct MaxOp {
__device__ float operator()(float a, float b) { return fmaxf(a, b); }
};
void cub_inclusive_max_scan(const float* d_input, float* d_output, int n) {
void* d_temp = nullptr;
size_t temp_bytes = 0;
MaxOp max_op;
float identity = -INFINITY;
cub::DeviceScan::InclusiveScan(
d_temp, temp_bytes, d_input, d_output, max_op, n);
cudaMalloc(&d_temp, temp_bytes);
cub::DeviceScan::InclusiveScan(
d_temp, temp_bytes, d_input, d_output, max_op, n);
cudaFree(d_temp);
}
// Block-level scan (composable in your own kernels)
__global__ void my_kernel_with_scan(const float* input, float* output, int n) {
typedef cub::BlockScan<float, 256> BlockScan;
__shared__ typename BlockScan::TempStorage temp_storage;
int idx = blockIdx.x * 256 + threadIdx.x;
float val = (idx < n) ? input[idx] : 0.0f;
float scanned;
BlockScan(temp_storage).ExclusiveSum(val, scanned);
if (idx < n) output[idx] = scanned;
}
CUB’s DeviceScan uses decoupled lookback internally, with additional optimizations: vectorized loads (float4), auto-tuned tile sizes per architecture, and fused lookback with local scan computation. For production code, CUB is the correct choice. Write custom scans only when fusing scan with other operations or when you need a specialized scan variant (e.g., segmented scan with dynamic segment boundaries).
Summary
Parallel scan (prefix sum) is the fundamental coordination primitive for GPU parallel algorithms. The Blelloch algorithm achieves work in depth, but the real engineering challenge is multi-block coordination. The three-pass approach (per-block scan, scan block sums, add prefixes) is simple but requires 3 kernel launches and 3 memory passes. Decoupled lookback achieves single-pass scan by having each block publish its aggregate and look back at predecessors — the expected lookback depth is amortized, making it bandwidth-optimal. The warp-level Kogge-Stone scan using __shfl_up_sync is the building block for all higher-level scans. In production, CUB’s DeviceScan provides the best performance with auto-tuned parameters.