Part of Series CUDA Kernel Engineering 34 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

Every CUDA program begins with the same three-step dance: copy data to the GPU, run a kernel, copy results back. For small workloads this is fine. For production workloads — training runs that last days, inference services that must hit latency targets under load, HPC simulations that iterate millions of times — that serial dance becomes a wall. The GPU sits idle while data crawls across PCIe, and the CPU sits idle while the GPU computes. CUDA streams are the mechanism that tears down that wall, letting you overlap data movement with computation, and even overlap different kernels with each other. But streams are not magic. They are a scheduling primitive, and using them effectively requires understanding what the hardware can actually do in parallel, where the bottlenecks live, and how to measure whether overlap is truly happening.

This post covers the full landscape: why the default stream serializes everything, how to create and manage multiple streams, how to achieve compute-transfer overlap with pinned memory, how PCIe and NVLink bandwidth numbers shape your optimization strategy, the multi-stream patterns used in production (pipelining, double buffering, multi-GPU fan-out), how to time operations correctly with CUDA events, how NCCL communication interacts with streams in distributed training, how to use Nsight Systems to verify overlap, and the cases where streams simply cannot help.

Why Streams Matter: The Default Stream Problem

When you launch a CUDA kernel or call cudaMemcpy without specifying a stream, everything goes into the default stream (also called stream 0 or the NULL stream). The default stream has a critical property: it serializes with all other streams. Every operation must complete before the next one begins.

Consider this baseline code:

cudaMemcpy(d_input, h_input, size, cudaMemcpyHostToDevice);   // Step 1
kernelA<<<grid, block>>>(d_input, d_output);                   // Step 2
cudaMemcpy(h_output, d_output, size, cudaMemcpyDeviceToHost);  // Step 3

The GPU timeline looks like three sequential boxes. The H2D copy must finish before kernelA begins. kernelA must finish before the D2H copy begins. The total wall-clock time is the sum of all three:

Tserial=TH2D+Tkernel+TD2HT_{\text{serial}} = T_{\text{H2D}} + T_{\text{kernel}} + T_{\text{D2H}}

For a 100 MB transfer over PCIe Gen4 x16 (roughly 25 GB/s effective), each direction takes about 4 ms. If the kernel takes 3 ms, the total is 4+3+4=114 + 3 + 4 = 11 ms. But during the 4 ms H2D copy, every SM on the GPU is doing nothing. During the 3 ms kernel, the PCIe link is doing nothing. During the 4 ms D2H copy, the SMs are idle again. You are using at most one resource at a time.

Multiple streams unlock three forms of concurrency:

  1. Concurrent kernel execution — two independent kernels on different streams can share the SMs simultaneously, provided there are enough SMs and registers to go around.
  2. Overlap of compute and data transfer — a kernel on stream A runs while a memcpy on stream B moves data, because they use different hardware units (SMs vs. copy engines).
  3. Overlap of different-direction transfers — on GPUs with two copy engines, an H2D copy and a D2H copy can run simultaneously.

These three forms compose. In the ideal case, you fill every cycle with useful work on every hardware unit.

The three hardware engines

Most modern NVIDIA GPUs (Kepler and later) expose at least three independent execution engines: one or more copy engines for H2D transfers, one or more copy engines for D2H transfers, and the SM array for kernel execution. The asyncEngineCount field from cudaDeviceProp tells you how many concurrent copy engines the device supports. A value of 2 means you can overlap H2D and D2H simultaneously.

Stream Fundamentals

Creation and Destruction

A CUDA stream is a sequence of operations that execute in issue order on the device. You create one with cudaStreamCreate:

cudaStream_t stream;
cudaStreamCreate(&stream);

// ... use the stream ...

cudaStreamDestroy(stream);

There is also cudaStreamCreateWithFlags, which accepts cudaStreamDefault or cudaStreamNonBlocking. The cudaStreamNonBlocking flag is important: it means this stream does not synchronize implicitly with the default stream. Without it, any operation on the default stream acts as a barrier across all streams.

cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

Issuing Work on a Stream

Kernels accept a stream as the fourth parameter of the launch configuration:

kernel<<<grid, block, sharedMem, stream>>>(args...);

Async memory copies take the stream as the last argument:

cudaMemcpyAsync(dst, src, size, cudaMemcpyHostToDevice, stream);

Ordering Guarantees

The fundamental rule is simple:

  • Within a single stream, operations execute in the order they were issued. Operation B will not begin until operation A has completed.
  • Across different streams, there is no ordering guarantee unless you insert explicit synchronization.

This means you can issue work on stream 0 and stream 1 in any interleaved order from the CPU side, and the GPU will execute them concurrently whenever the hardware permits.

Synchronization Primitives

CUDA provides several synchronization mechanisms:

// Block host until everything on every stream completes
cudaDeviceSynchronize();

// Block host until everything on this specific stream completes
cudaStreamSynchronize(stream);

// Record an event into a stream
cudaEvent_t event;
cudaEventCreate(&event);
cudaEventRecord(event, streamA);

// Make streamB wait until the event in streamA has been reached
cudaStreamWaitEvent(streamB, event, 0);

The cudaStreamWaitEvent call is the key building block for complex dependency graphs. It does not block the host — it inserts a dependency on the device side only. Stream B will not begin executing any subsequently enqueued work until the event recorded in stream A has completed.

ℹ️ Stream semantics and the legacy default stream

By default, the NULL stream (stream 0) has implicit synchronization semantics: launching work on stream 0 waits for all prior work on all streams, and all other streams wait for work on stream 0 to complete. This can inadvertently serialize your pipeline if any call accidentally targets stream 0. Compile with --default-stream per-thread or use cudaStreamNonBlocking to avoid this trap.

Compute-Transfer Overlap: The Core Technique

The idea is straightforward: while the GPU computes on batch NN, copy the data for batch N+1N+1 to the device, and copy the results from batch N1N-1 back to the host. This forms a three-stage pipeline.

Pinned Memory: The Non-Negotiable Prerequisite

Standard host memory allocated with malloc or new is pageable — the OS can swap it out at any time. When you call cudaMemcpyAsync with pageable memory, the CUDA runtime cannot guarantee that the data will be in physical RAM when the DMA engine needs it. So the runtime falls back to a synchronous, staged copy: it first copies your data into an internal pinned buffer, then DMAs from there. This staging serializes the transfer with the host, defeating the purpose of async copies.

Pinned (page-locked) memory is allocated with cudaHostAlloc (or cudaMallocHost, which is equivalent):

float* h_pinned;
cudaHostAlloc(&h_pinned, size, cudaHostAllocDefault);

// Alternative: pin existing memory
float* h_existing = (float*)malloc(size);
cudaHostRegister(h_existing, size, cudaHostRegisterDefault);

Pinned memory has a fixed physical address that the DMA engine can access directly, enabling true asynchronous transfers. The trade-off: pinned memory reduces the amount of pageable memory available to the OS, and allocating too much can cause system-wide performance degradation.

⚠️ Do not pin all your memory

Pinned memory is a scarce system resource. Allocating tens of gigabytes of pinned memory can starve the OS page cache and slow down unrelated processes. A good rule of thumb is to pin only the buffers that sit on the hot path of your pipeline — the ones you copy every iteration.

The Timeline: Without Overlap vs. With Overlap

Consider processing 4 tiles of data. Without overlap, the timeline is purely sequential:

Stream 0: [H2D-0][Kernel-0][D2H-0][H2D-1][Kernel-1][D2H-1][H2D-2][Kernel-2][D2H-2][H2D-3][Kernel-3][D2H-3]

With two streams and double buffering:

Stream 0: [H2D-0][  Kernel-0  ][D2H-0]       [H2D-2][  Kernel-2  ][D2H-2]
Stream 1:        [H2D-1][  Kernel-1  ][D2H-1]        [H2D-3][  Kernel-3  ][D2H-3]
Copy Eng:  H2D-0  H2D-1              D2H-0 H2D-2     H2D-3               D2H-2
SMs:              Kernel-0   Kernel-1        Kernel-2   Kernel-3

In the overlapped version, while Kernel-0 runs on the SMs, H2D-1 runs on the copy engine. The kernel and the copy use different hardware, so they truly execute simultaneously.

Theoretical Speedup

If each tile has transfer time TcT_c (for both H2D and D2H combined) and compute time TkT_k, the serial time for NN tiles is:

Tserial=N(Tc+Tk)T_{\text{serial}} = N \cdot (T_c + T_k)

With perfect overlap (assuming TkTcT_k \geq T_c), the overlapped time is approximately:

ToverlapTc+NTkT_{\text{overlap}} \approx T_c + N \cdot T_k

The speedup approaches:

Speedup=Tc+TkTk=1+TcTk\text{Speedup} = \frac{T_c + T_k}{T_k} = 1 + \frac{T_c}{T_k}

When TcTkT_c \approx T_k, you get close to a 2x speedup. When TkTcT_k \gg T_c, the copies are already hidden and streams add negligible benefit. When TcTkT_c \gg T_k, you are transfer-bound and no amount of overlap hides the copy time.

Serialized vs. Overlapped Pipeline (4 tiles)

(ms)
Serialized 4 x (1.2 + 2.0 + 1.0)
16.8 ms
Overlapped Startup + 4 x max(1.2, 2.0, 1.0)
9.2 ms

The Double-Buffered Implementation

Here is the classic two-stream, double-buffered pipeline:

const int NUM_STREAMS = 2;
cudaStream_t streams[NUM_STREAMS];
float *h_in[NUM_STREAMS], *h_out[NUM_STREAMS];
float *d_in[NUM_STREAMS], *d_out[NUM_STREAMS];

for (int i = 0; i < NUM_STREAMS; i++) {
    cudaStreamCreate(&streams[i]);
    cudaHostAlloc(&h_in[i],  tileBytes, cudaHostAllocDefault);
    cudaHostAlloc(&h_out[i], tileBytes, cudaHostAllocDefault);
    cudaMalloc(&d_in[i],  tileBytes);
    cudaMalloc(&d_out[i], tileBytes);
}

for (int tile = 0; tile < numTiles; tile++) {
    int s = tile % NUM_STREAMS;

    // Wait for previous use of this buffer to complete
    cudaStreamSynchronize(streams[s]);

    // Fill host input buffer for this tile
    prepareTile(h_in[s], tile);

    // Async H2D
    cudaMemcpyAsync(d_in[s], h_in[s], tileBytes,
                    cudaMemcpyHostToDevice, streams[s]);

    // Launch kernel
    processKernel<<<grid, block, 0, streams[s]>>>(d_in[s], d_out[s]);

    // Async D2H
    cudaMemcpyAsync(h_out[s], d_out[s], tileBytes,
                    cudaMemcpyDeviceToHost, streams[s]);
}

cudaDeviceSynchronize();

The cudaStreamSynchronize at the top of each iteration ensures we do not overwrite a host buffer that the previous iteration’s D2H copy is still reading from. This is a common pattern that trades a small amount of serialization for correctness.

The bandwidth of the interconnect between host and device determines how much time your copies take, and therefore how much benefit overlap can deliver. The gap between interconnect technologies is enormous:

📊

Interconnect bandwidth comparison

InterconnectTheoretical BW (per direction)Typical Effective BWLatency
PCIe Gen3 x16 16 GB/s ~12 GB/s ~1-2 us
PCIe Gen4 x16 32 GB/s ~25 GB/s ~1-2 us
PCIe Gen5 x16 64 GB/s ~50 GB/s ~1-2 us
NVLink 3.0 (A100) 600 GB/s total ~500 GB/s ~0.7 us
NVLink 4.0 (H100) 900 GB/s total ~800 GB/s ~0.5 us
Note: NVLink bandwidth is aggregate bidirectional. PCIe bandwidth is per direction. Effective BW depends on transfer size, system configuration, and NUMA topology.

The Bandwidth Gap and Its Implications

At PCIe Gen4 speeds, transferring 1 GB of data takes roughly 40 ms. A typical training iteration kernel on an A100 might take 10-50 ms depending on the model. These times are in the same ballpark, which means overlap delivers significant benefit.

At NVLink 4.0 speeds between GPUs (GPU-to-GPU, not host-to-GPU), transferring 1 GB takes roughly 1.25 ms. If the same kernel takes 30 ms, the transfer is already less than 5% of the compute time. Overlap still helps, but the marginal gain is far smaller.

This leads to an important heuristic:

Overlap benefitTtransferTcompute\text{Overlap benefit} \propto \frac{T_{\text{transfer}}}{T_{\text{compute}}}

When transfers are a large fraction of compute, overlap is critical. When transfers are negligible compared to compute, the engineering effort of managing multiple streams may not be justified.

Transfer time for 1 GB across interconnects

(ms)
PCIe Gen3
83 ms
PCIe Gen4
40 ms
PCIe Gen5
20 ms
NVLink 4.0
1.25 ms

NUMA Considerations

On multi-socket systems, PCIe bandwidth further depends on which CPU socket the GPU is attached to. A GPU on socket 0 accessed from a thread running on socket 1 must traverse the inter-socket link (UPI or Infinity Fabric), which can halve effective bandwidth. Always pin your CPU threads to the NUMA node closest to the target GPU using cudaSetDevice and CPU affinity (numactl --cpunodebind=N on Linux).

Multi-Stream Patterns

Pattern 1: Copy-Compute-Copy Pipeline

The simplest and most common pattern. Each tile goes through three stages, and different tiles occupy different stages simultaneously:

Time --->
Stream 0: [H2D-0] [Compute-0] [D2H-0]         [H2D-2] [Compute-2] [D2H-2]
Stream 1:         [H2D-1] [Compute-1] [D2H-1]          [H2D-3] [Compute-3] [D2H-3]

This is the pattern shown in the double-buffering code above. Two streams suffice to fill the pipeline when each stage takes approximately the same time.

Pattern 2: Triple Buffering

When stage durations are unequal — say, the kernel takes much longer than the copies — three streams can keep the pipeline fuller:

const int NUM_STREAMS = 3;
cudaStream_t streams[NUM_STREAMS];
// ... allocate 3 sets of buffers ...

for (int tile = 0; tile < numTiles; tile++) {
    int s = tile % NUM_STREAMS;
    cudaStreamSynchronize(streams[s]);
    prepareTile(h_in[s], tile);
    cudaMemcpyAsync(d_in[s], h_in[s], tileBytes,
                    cudaMemcpyHostToDevice, streams[s]);
    processKernel<<<grid, block, 0, streams[s]>>>(d_in[s], d_out[s]);
    cudaMemcpyAsync(h_out[s], d_out[s], tileBytes,
                    cudaMemcpyDeviceToHost, streams[s]);
}

With three streams, you can have one tile in the H2D phase, one in compute, and one in D2H simultaneously, maximizing hardware utilization when the three stages have different durations.

Pattern 3: Multi-GPU Fan-Out

In multi-GPU setups, each GPU gets its own stream (or set of streams), and the host fans out work:

for (int gpu = 0; gpu < numGPUs; gpu++) {
    cudaSetDevice(gpu);
    cudaMemcpyAsync(d_in[gpu], h_in + gpu * chunkSize, chunkBytes,
                    cudaMemcpyHostToDevice, streams[gpu]);
    processKernel<<<grid, block, 0, streams[gpu]>>>(d_in[gpu], d_out[gpu]);
    cudaMemcpyAsync(h_out + gpu * chunkSize, d_out[gpu], chunkBytes,
                    cudaMemcpyDeviceToHost, streams[gpu]);
}

for (int gpu = 0; gpu < numGPUs; gpu++) {
    cudaSetDevice(gpu);
    cudaStreamSynchronize(streams[gpu]);
}

Each GPU has its own PCIe link, so transfers to different GPUs are inherently parallel. The fan-out pattern scales nearly linearly with the number of GPUs when the workload is data-parallel.

Stream Dependency Graphs with Events

Real pipelines are rarely simple linear chains. You may need to express complex dependencies: “stream 2 should not start its kernel until stream 0 and stream 1 have both finished their H2D copies.” CUDA events let you build arbitrary DAGs:

cudaEvent_t h2d_done[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++)
    cudaEventCreate(&h2d_done[i]);

// Stream 0: H2D for partition A
cudaMemcpyAsync(d_A, h_A, sizeA, cudaMemcpyHostToDevice, stream0);
cudaEventRecord(h2d_done[0], stream0);

// Stream 1: H2D for partition B
cudaMemcpyAsync(d_B, h_B, sizeB, cudaMemcpyHostToDevice, stream1);
cudaEventRecord(h2d_done[1], stream1);

// Stream 2: kernel that needs both A and B
cudaStreamWaitEvent(stream2, h2d_done[0], 0);
cudaStreamWaitEvent(stream2, h2d_done[1], 0);
fusedKernel<<<grid, block, 0, stream2>>>(d_A, d_B, d_C);

This pattern is common in deep learning frameworks where a single layer may consume tensors that arrived via different transfer streams.

💡 Keep your dependency graph minimal

Every cudaStreamWaitEvent is a potential serialization point. Only insert dependencies where correctness requires them. Unnecessary dependencies turn your “parallel” pipeline back into a serial one. Drawing the dependency graph on paper before writing code helps catch redundant edges.

CUDA Events for Timing

Basic Event Timing

CUDA events record timestamps on the GPU timeline, giving you accurate device-side measurements unaffected by host-side scheduling jitter:

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, stream);
myKernel<<<grid, block, 0, stream>>>(args);
cudaEventRecord(stop, stream);

// Block the host until 'stop' is reached
cudaEventSynchronize(stop);

float milliseconds = 0.0f;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel took %.3f ms\n", milliseconds);

The resolution of CUDA events is approximately 0.5 microseconds on modern GPUs.

Measuring Overlap

To verify that overlap is happening, time the entire pipeline both with and without streams:

// Measure serial version
cudaEventRecord(startSerial, 0);
for (int tile = 0; tile < numTiles; tile++) {
    cudaMemcpy(d_in, h_in[tile], tileBytes, cudaMemcpyHostToDevice);
    kernel<<<grid, block>>>(d_in, d_out);
    cudaMemcpy(h_out[tile], d_out, tileBytes, cudaMemcpyDeviceToHost);
}
cudaEventRecord(stopSerial, 0);
cudaEventSynchronize(stopSerial);
cudaEventElapsedTime(&serialTime, startSerial, stopSerial);

// Measure overlapped version
cudaEventRecord(startOverlap, 0);
for (int tile = 0; tile < numTiles; tile++) {
    int s = tile % NUM_STREAMS;
    cudaMemcpyAsync(d_in[s], h_in[tile], tileBytes,
                    cudaMemcpyHostToDevice, streams[s]);
    kernel<<<grid, block, 0, streams[s]>>>(d_in[s], d_out[s]);
    cudaMemcpyAsync(h_out[tile], d_out[s], tileBytes,
                    cudaMemcpyDeviceToHost, streams[s]);
}
cudaEventRecord(stopOverlap, 0);
cudaEventSynchronize(stopOverlap);
cudaEventElapsedTime(&overlapTime, startOverlap, stopOverlap);

printf("Serial: %.2f ms, Overlapped: %.2f ms, Speedup: %.2fx\n",
       serialTime, overlapTime, serialTime / overlapTime);
⚠️ Common measurement trap

If you time from the CPU using clock() or std::chrono without calling cudaDeviceSynchronize() first, you measure the time to enqueue work, not the time for it to execute. CPU-side timing of async operations will appear near-zero. Always synchronize before reading the clock, or use CUDA events for device-side timing.

Per-Stream Timing

You can record events in individual streams to measure the duration of specific stages:

cudaEvent_t h2d_start, h2d_stop, kern_start, kern_stop, d2h_start, d2h_stop;
// ... create all events ...

cudaEventRecord(h2d_start, stream);
cudaMemcpyAsync(d_in, h_in, bytes, cudaMemcpyHostToDevice, stream);
cudaEventRecord(h2d_stop, stream);

cudaEventRecord(kern_start, stream);
kernel<<<grid, block, 0, stream>>>(d_in, d_out);
cudaEventRecord(kern_stop, stream);

cudaEventRecord(d2h_start, stream);
cudaMemcpyAsync(h_out, d_out, bytes, cudaMemcpyDeviceToHost, stream);
cudaEventRecord(d2h_stop, stream);

cudaStreamSynchronize(stream);

float h2d_ms, kern_ms, d2h_ms;
cudaEventElapsedTime(&h2d_ms, h2d_start, h2d_stop);
cudaEventElapsedTime(&kern_ms, kern_start, kern_stop);
cudaEventElapsedTime(&d2h_ms, d2h_start, d2h_stop);

This breakdown tells you which stage dominates and therefore where overlap has the most potential.

📊

Example per-stage timing (A100, 256 MB tile)

StageDuration (ms)Hardware UnitOverlap Candidate?
H2D Copy 10.2 Copy Engine 0 Yes -- runs parallel to kernels
Kernel 18.5 SMs Yes -- runs parallel to copies
D2H Copy 10.2 Copy Engine 1 Yes -- runs parallel to kernels
Serial Total 38.9 One at a time Baseline
Overlapped Total 19.8 All concurrent 1.96x speedup
Note: Measured on A100 80GB SXM, PCIe Gen4, pinned host memory. Kernel is a compute-bound matrix operation.

NCCL and Streams: Overlapping Communication with Computation

In distributed training with multiple GPUs, the NVIDIA Collective Communications Library (NCCL) handles allreduce, allgather, reduce-scatter, and other collective operations. Understanding how NCCL interacts with CUDA streams is essential for achieving good overlap between communication and computation.

NCCL Runs on Its Own Stream

Every NCCL communicator can be associated with a CUDA stream. When you call ncclAllReduce, the operation is enqueued on that stream and executes asynchronously:

ncclComm_t comm;
cudaStream_t nccl_stream, compute_stream;
cudaStreamCreate(&nccl_stream);
cudaStreamCreate(&compute_stream);

// Compute gradients for layer N on compute_stream
backwardLayerN<<<grid, block, 0, compute_stream>>>(grads_N);

// Once gradients are ready, allreduce them on nccl_stream
cudaEvent_t grads_ready;
cudaEventCreate(&grads_ready);
cudaEventRecord(grads_ready, compute_stream);
cudaStreamWaitEvent(nccl_stream, grads_ready, 0);

ncclAllReduce(grads_N, grads_N, count, ncclFloat,
              ncclSum, comm, nccl_stream);

// Meanwhile, compute gradients for layer N-1 on compute_stream
backwardLayerN_minus_1<<<grid, block, 0, compute_stream>>>(grads_N_minus_1);

The key insight: while ncclAllReduce for layer NN runs on nccl_stream, the backward pass for layer N1N-1 runs on compute_stream. Since these are different streams targeting different hardware (NVLink/network for NCCL, SMs for the backward kernel), they can overlap.

The SM Contention Problem

Here is the catch: NCCL collectives are not pure network operations. They run GPU kernels internally — kernels that read from and write to GPU memory, and that use SMs to orchestrate the communication. If your compute kernel occupies all available SMs, NCCL’s kernels cannot launch until the compute kernel finishes, even though they are on a different stream.

This is the SM contention problem, and it is one of the most common reasons that communication-computation overlap fails in practice.

🚨 SM contention kills overlap

If your backward pass kernel is configured to use all available SMs (for example, by launching with a grid size equal to the total SM count), NCCL communication kernels will be queued behind it, even on a separate stream. The CUDA work scheduler cannot preempt a running kernel to make room for another. To enable overlap, you must leave some SMs free — either by reducing the grid size of your compute kernel or by using CUDA MPS (Multi-Process Service) to partition the GPU.

Strategies for Enabling NCCL Overlap

1. Reduce compute kernel grid size. Launch your backward kernels with fewer blocks than the total SM count. For example, on an A100 with 108 SMs, launching with 80 blocks leaves 28 SMs available for NCCL kernels.

2. Use CUDA stream priorities. Create the NCCL stream with high priority so that its kernels are scheduled preferentially:

int leastPriority, greatestPriority;
cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);

cudaStream_t nccl_stream;
cudaStreamCreateWithPriority(&nccl_stream,
                             cudaStreamNonBlocking,
                             greatestPriority);

3. Use per-layer overlap. Modern frameworks like PyTorch’s DistributedDataParallel (DDP) and FSDP bucket gradients by layer and overlap the allreduce of layer NN‘s gradients with the backward pass of layer N1N-1. This works because the gradients for layer NN are fully computed before the backward pass moves to layer N1N-1.

4. Use CUDA MPS for multi-tenant scenarios. MPS allows multiple processes to share a GPU with partitioned SM resources, ensuring each process (including NCCL) gets a guaranteed slice of SMs.

📊

Communication-compute overlap effectiveness

ConfigurationBackward TimeAllReduce TimeWall TimeOverlap %
No overlap (serial) 30 ms 15 ms 45 ms 0%
Separate streams, full SM 30 ms 15 ms 42 ms ~7%
Separate streams, 75% SM 33 ms 15 ms 35 ms ~67%
Per-layer overlap, 75% SM 33 ms 15 ms 33 ms ~100%
Note: Illustrative numbers for a ResNet-50 backward pass on 4xA100 with NVLink. Leaving 25% of SMs free adds ~10% to backward time but saves ~27% wall time through overlap.

Nsight Systems Profiling: Verifying Overlap

Claiming overlap based on code structure alone is insufficient. You must verify it with a profiling tool. NVIDIA Nsight Systems is the standard tool for visualizing the GPU timeline and identifying serialization.

Collecting a Profile

nsys profile --trace=cuda,nvtx,osrt \
             --output=my_profile \
             --force-overwrite=true \
             ./my_cuda_app

This produces a .nsys-rep file that you open in the Nsight Systems GUI. The timeline shows:

  • CUDA API row: host-side API calls (cudaMemcpyAsync, cudaLaunchKernel, etc.)
  • CUDA HW row: actual device-side execution, broken into sub-rows for each stream
  • Memory copy row: H2D and D2H transfers, with bandwidth information
  • Kernel row: kernel executions on SMs, with occupancy and duration

What to Look For

Good overlap looks like stacked bars: a kernel bar on the SM row runs simultaneously with a memcpy bar on the copy engine row. The bars overlap in time.

Bad serialization looks like sequential bars: every operation finishes before the next begins, even though they are on different streams. Common causes:

  1. Pageable memory — the runtime serialized your async copies.
  2. Default stream contamination — a stray API call on stream 0 inserted an implicit barrier.
  3. Host-side blocking — a cudaStreamSynchronize or cudaDeviceSynchronize in the middle of your pipeline.
  4. Kernel filling all SMs — the copy engine is technically running, but at reduced bandwidth because the memory controller is saturated by the kernel.

Measuring Overlap Ratio

The overlap ratio quantifies how much concurrent execution you achieved:

Overlap Ratio=1TwallTserial\text{Overlap Ratio} = 1 - \frac{T_{\text{wall}}}{T_{\text{serial}}}

Where TwallT_{\text{wall}} is the actual wall-clock time of the overlapped pipeline, and TserialT_{\text{serial}} is the sum of all individual operation durations. An overlap ratio of 0 means no overlap. An overlap ratio of 0.5 means half the potential parallelism was realized.

In Nsight Systems, you can measure this by selecting the range of interest on the timeline and comparing the wall-clock extent against the sum of individual kernel and memcpy durations.

Identifying Synchronization Bottlenecks

Look for gaps in the timeline — periods where neither SMs nor copy engines are active. These gaps indicate synchronization points where the GPU is waiting. Common culprits:

  • cudaStreamSynchronize called too early
  • cudaDeviceSynchronize in a loop
  • An event wait that blocks an entire stream unnecessarily
  • Host-side computation between GPU launches (the host cannot enqueue the next kernel fast enough)
💡 Use NVTX markers for clarity

Annotate your code with NVTX ranges to make the timeline readable:

#include <nvToolsExt.h>
nvtxRangePush("H2D batch 3");
cudaMemcpyAsync(...);
nvtxRangePop();

These markers appear as colored bars on the CPU timeline, making it trivial to correlate host-side logic with device-side execution.

Advanced: The Nsight Systems CLI for Automated Analysis

For CI pipelines, you can extract metrics programmatically:

nsys stats --report cuda_gpu_trace my_profile.nsys-rep

This dumps a table of every GPU operation with start time, duration, stream ID, and correlation ID. You can script overlap analysis by looking for time-overlapping entries on different streams.

When Streams Don’t Help

Streams are not a universal accelerator. There are well-defined scenarios where adding streams provides zero benefit or even degrades performance.

A Single Large Kernel That Fills All SMs

If your workload is a single monolithic kernel that occupies every SM on the GPU for its entire duration, there is no room for concurrent execution. A second kernel on a different stream would simply queue behind the first. Similarly, the copy engines may run but at reduced bandwidth because the memory controller is contended.

This is common with:

  • Large matrix multiplications (cuBLAS GEMM on large dimensions)
  • Kernels launched with grid sizes equal to or exceeding the SM count with high occupancy
  • Persistent kernels that loop internally

In these cases, the optimization path is not “add more streams” but rather “split the kernel into smaller chunks” or “reduce the grid size to leave room for other work.”

Insufficient Parallelism

If your application processes a single small batch of data, there may simply not be enough independent work to fill multiple streams. Two streams each processing half the data might each underutilize the GPU, yielding worse performance than a single stream processing the full batch (due to kernel launch overhead and reduced occupancy per kernel).

The break-even point depends on your specific kernel, but as a rough guide: if your kernel takes less than 50 microseconds, the overhead of managing multiple streams (creation, synchronization, event recording) can exceed the benefit.

Explicit Synchronization Points

If your algorithm requires a global barrier between steps — for example, step N+1N+1 reads the output of step NN across all tiles — then streams cannot overlap consecutive steps. The dependency is fundamental, not an artifact of the programming model.

Common examples:

  • Iterative solvers where each iteration reads the previous iteration’s output
  • Reduction operations that produce a single scalar consumed by the next phase
  • Host-side decision logic (branching based on a GPU-computed value)

Transfer-Bound Workloads with Short Kernels

If TH2D+TD2HTkernelT_{\text{H2D}} + T_{\text{D2H}} \gg T_{\text{kernel}}, overlap can hide the kernel inside the transfer, but the total time is still dominated by the transfer. Adding streams does not reduce transfer time — PCIe bandwidth is fixed. The right solution here is to reduce the data transferred (compression, quantization, sparse representations) or to move the data less frequently.

Host-Side Bottleneck

If the CPU cannot enqueue work fast enough, the GPU starves regardless of how many streams are available. This happens when:

  • Host-side preprocessing between kernel launches is expensive
  • The CUDA driver overhead per launch is significant (many tiny kernels)
  • Python-based frameworks have GIL contention or interpreter overhead between CUDA calls
📊

When streams help vs. when they do not

ScenarioStream BenefitBetter Alternative
Large GEMM filling all SMs None Split into sub-problems or accept serial execution
Balanced copy + compute pipeline Up to 2x Use 2-3 streams with double buffering
Transfer-bound (tiny kernels) Minimal Reduce data movement, use compression
Iterative solver with global sync None per iteration Overlap across outer loop if possible
Multi-GPU data parallel Near-linear scaling One stream set per GPU, overlap H2D/compute/D2H
DNN training backward + allreduce 20-40% wall time savings Per-layer overlap with NCCL on high-priority stream

Putting It All Together: A Production Checklist

Here is the step-by-step process for adding stream-based overlap to an existing CUDA application:

Step 1: Profile the baseline. Use Nsight Systems to measure the serial timeline. Identify the three durations: TH2DT_{\text{H2D}}, TkernelT_{\text{kernel}}, and TD2HT_{\text{D2H}}. If one dominates overwhelmingly, overlap of the others yields minimal benefit — focus on the dominant term first.

Step 2: Pin your host memory. Replace malloc with cudaHostAlloc for any buffer involved in async copies. Verify with the profiler that cudaMemcpyAsync appears as a true async operation (no synchronous staging).

Step 3: Chunk your workload. Divide the input into tiles small enough that the kernel for one tile leaves room for a copy of another tile. Too few tiles means not enough iterations to fill the pipeline. Too many tiles means excessive launch overhead.

Step 4: Create streams and double-buffer. Start with 2 streams and 2 sets of device buffers. Issue the copy-compute-copy sequence for each tile on alternating streams.

Step 5: Profile again. Open the Nsight Systems timeline. Verify that kernel bars and memcpy bars overlap in time. Compute the overlap ratio. If overlap is poor, check for the failure modes described above.

Step 6: Tune the number of streams. Try 3 or 4 streams. More than 4 rarely helps because you run out of hardware parallelism (there are only 2 copy engines and one SM array). Measure wall-clock time for each configuration.

Step 7: Add event-based dependencies if needed. If your pipeline has complex data dependencies (e.g., kernel on stream 2 needs results from streams 0 and 1), use cudaStreamWaitEvent to express them without unnecessary global synchronization.

Step 8: Integrate with NCCL (for distributed training). Put NCCL operations on a separate high-priority stream. Use events to express the “gradients ready” dependency. Leave 20-30% of SMs free for NCCL kernels by adjusting your compute kernel grid size.

Optimization progression: serial to fully overlapped

(ms)
Serial baseline H2D + kernel + D2H
45 ms
+ Pinned memory Faster copies but still serial
44 ms
+ 2 streams Compute-transfer overlap
28 ms
+ 3 streams Better pipeline filling
24 ms
+ NCCL overlap Communication hidden behind compute
19 ms

Advanced Topics

Stream Callbacks

CUDA allows you to register a host-side callback function that fires when all preceding work in a stream completes:

void CUDART_CB myCallback(cudaStream_t stream, cudaError_t status, void* userData) {
    printf("Stream %p finished, status: %d\n", (void*)stream, status);
    // Signal the host, trigger the next pipeline stage, etc.
}

cudaStreamAddCallback(stream, myCallback, nullptr, 0);

Callbacks execute on a CUDA runtime thread, not the thread that enqueued the work. They are useful for signaling completion to other parts of a host-side pipeline without polling.

Stream Capture and CUDA Graphs

Starting with CUDA 10, you can capture a sequence of stream operations into a CUDA Graph, then replay that graph with a single launch call. This eliminates per-operation launch overhead and is especially beneficial for pipelines with many small kernels:

cudaGraph_t graph;
cudaGraphExec_t graphExec;

cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

// Issue the pipeline operations normally
cudaMemcpyAsync(d_in, h_in, bytes, cudaMemcpyHostToDevice, stream);
kernel<<<grid, block, 0, stream>>>(d_in, d_out);
cudaMemcpyAsync(h_out, d_out, bytes, cudaMemcpyDeviceToHost, stream);

cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);

// Replay the captured graph multiple times
for (int i = 0; i < numIterations; i++) {
    cudaGraphLaunch(graphExec, stream);
}
cudaStreamSynchronize(stream);

Graphs preserve the stream-based concurrency of the captured operations while eliminating the CPU overhead of re-issuing them each iteration. For repetitive pipelines (e.g., inference on fixed-size batches), this can reduce launch overhead by 10-50x.

Cooperative Groups and Stream Interaction

CUDA Cooperative Groups (introduced in CUDA 9) allow kernels to synchronize across thread blocks within a grid. When using cooperative launches (cudaLaunchCooperativeKernel), the kernel occupies a fixed number of SMs for its entire duration and synchronizes internally. This interacts with streams in an important way: a cooperative kernel cannot be preempted, and no other kernel can run concurrently on the SMs it occupies. Plan your stream-based overlap accordingly.

Conclusion

CUDA streams are the fundamental mechanism for extracting pipeline parallelism from GPU hardware. They let you overlap data transfers with computation, overlap multiple independent kernels, and overlap communication with computation in distributed settings. But they are a low-level scheduling tool, not a magic performance switch.

The key principles:

  • Commands in the same stream execute in order. Commands in different streams may execute concurrently.
  • Async transfers require pinned host memory. Without it, the runtime silently falls back to synchronous copies.
  • The benefit of overlap is proportional to Ttransfer/TcomputeT_{\text{transfer}} / T_{\text{compute}}. PCIe-connected GPUs benefit greatly; NVLink-connected GPUs may already have negligible transfer overhead.
  • NCCL communication can overlap with computation, but only if enough SMs are free for NCCL’s internal kernels.
  • Nsight Systems is the ground truth. Never trust overlap claims that have not been verified on a profiler timeline.
  • Streams cannot help when a single kernel fills all SMs, when explicit synchronization creates fundamental barriers, or when the host cannot enqueue work fast enough.

Start with a profiled baseline, pin your memory, chunk your workload, create two streams, and measure again. The gap between the serial total and the overlapped wall time tells you exactly how much concurrency you achieved — and how much remains on the table.