Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps

Part of Series CUDA Kernel Engineering 14 of 32

1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

A recursive octree traversal implemented with host-side launches requires one CPU-GPU round-trip per tree level — 8 microseconds per level, 80 microseconds for a depth-10 tree. Implement the same algorithm with Dynamic Parallelism and the GPU launches child kernels directly from parent kernels without CPU involvement. Total latency: 50 microseconds. The 40% speedup comes from eliminating CPU-GPU synchronization points. But dynamic parallelism is not free: each device-side launch costs 20-50 microseconds versus 8 microseconds for a host-side launch. It wins only when the alternative involves many round-trips or when the work decomposition is data-dependent and cannot be precomputed on the host.

All measurements target NVIDIA Ampere (A100-80GB SXM, SM 8.0) unless stated otherwise.

Basic Device-Side Kernel Launch

#include <cuda_runtime.h>
#include <cstdio>

// Child kernel — runs on the GPU, launched by parent kernel
__global__ void child_kernel(float* data, int start, int count) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < count) {
        data[start + idx] *= 2.0f;
    }
}

// Parent kernel — launches child kernels from the GPU
__global__ void parent_kernel(float* data, int n, int num_chunks) {
    int chunk_id = threadIdx.x + blockIdx.x * blockDim.x;
    if (chunk_id >= num_chunks) return;

    int chunk_size = n / num_chunks;
    int start = chunk_id * chunk_size;
    int count = (chunk_id == num_chunks - 1) ? (n - start) : chunk_size;

    // Launch child kernel from device code
    int block_size = 256;
    int grid_size = (count + block_size - 1) / block_size;
    child_kernel<<<grid_size, block_size>>>(data, start, count);
}

int main() {
    int n = 1 << 20;
    float* d_data;
    cudaMalloc(&d_data, n * sizeof(float));

    // Launch parent with 1 block of 16 threads -> 16 child kernels
    parent_kernel<<<1, 16>>>(d_data, n, 16);
    cudaDeviceSynchronize();

    cudaFree(d_data);
    return 0;
}

Compilation Requirements

# Dynamic parallelism requires:
# Compute capability >= 3.5 (sm_35+)
# Relocatable device code (-rdc=true)
# Link with cudadevrt library (-lcudadevrt)

nvcc -arch=sm_80 -rdc=true kernel.cu -o kernel -lcudadevrt

# Without -rdc=true, the compiler will reject device-side <<<>>> launches

Synchronization Semantics

Child kernels launched by a parent are asynchronous by default. The parent thread continues executing while child grids are scheduled. Explicit synchronization is required to ensure child completion:

__global__ void parent_with_sync(float* data, float* results, int n) {
    int tid = threadIdx.x;

    // Phase 1: Launch children
    int chunk_size = n / blockDim.x;
    int start = tid * chunk_size;
    int count = chunk_size;

    child_kernel<<<(count + 255) / 256, 256>>>(data, start, count);

    // Phase 2: Wait for ALL child kernels launched by this block
    // cudaDeviceSynchronize() on device waits for all children of the calling thread
    cudaDeviceSynchronize();

    // Phase 3: Now safe to read results written by children
    if (tid == 0) {
        float sum = 0.0f;
        for (int i = 0; i < n; i++) sum += data[i];
        results[blockIdx.x] = sum;
    }
}

⚠️ Device-Side cudaDeviceSynchronize Scope

cudaDeviceSynchronize() called from device code synchronizes ALL child grids launched by the calling thread block, not just those launched by the calling thread. If thread 0 launches kernel A and thread 1 launches kernel B, a cudaDeviceSynchronize() from any thread in the block waits for both A and B. There is no per-thread synchronization for child kernels.

Implicit Synchronization

When a parent kernel completes (all parent threads exit), the CUDA runtime implicitly waits for all child and descendant kernels to complete before signaling the parent’s stream event to the host:

__global__ void parent_no_sync(float* data, int n) {
    // Launch child
    child_kernel<<<(n + 255) / 256, 256>>>(data, 0, n);

    // Parent exits WITHOUT calling cudaDeviceSynchronize()
    // The child WILL complete before the host sees the parent's stream event
    // But the parent thread itself does NOT wait — it exits immediately
}

// Host code:
// parent_no_sync<<<1, 1>>>(d_data, n);
// cudaDeviceSynchronize();  // Waits for parent AND all descendants

Memory Visibility Rules

Memory visibility between parent and child follows specific rules:

__global__ void memory_visibility_demo(float* global_data, int n) {
    // Rule 1: Global memory written by parent is visible to child
    //         AFTER the child launch (launch acts as a memory fence)
    global_data[threadIdx.x] = (float)threadIdx.x;

    // The <<<>>> launch implies a memory fence for global memory
    child_kernel<<<1, 256>>>(global_data, 0, 256);

    // Rule 2: Global memory written by child is visible to parent
    //         ONLY after cudaDeviceSynchronize()
    cudaDeviceSynchronize();
    float val = global_data[0];  // Now safe to read child's output

    // Rule 3: Shared memory is NOT visible to children
    __shared__ float smem[256];
    smem[threadIdx.x] = 42.0f;
    // child_kernel CANNOT access smem — it has its own shared memory

    // Rule 4: Local variables (registers) are NOT shared
    float local_val = 3.14f;
    // child_kernel cannot see local_val

    // Rule 5: Constant memory is visible (read-only, set before parent launch)
}

Device-Side malloc and Free

// Device-side memory allocation (from a pre-allocated device heap)
__global__ void device_malloc_demo(float** results, int n) {
    // Allocate memory from the device heap
    float* local_buffer = (float*)malloc(n * sizeof(float));

    if (local_buffer == nullptr) {
        // Device heap exhausted — need to increase with cudaDeviceSetLimit
        return;
    }

    // Use the buffer
    for (int i = 0; i < n; i++) {
        local_buffer[i] = (float)i;
    }

    // Pass to child
    child_kernel<<<(n + 255) / 256, 256>>>(local_buffer, 0, n);
    cudaDeviceSynchronize();

    // Store result pointer
    results[blockIdx.x] = local_buffer;

    // WARNING: Must free from device code or the memory leaks
    // free(local_buffer);  // If not needed after this kernel
}

// Configure device heap size before launching
void configure_heap() {
    // Default device heap is 8 MB
    size_t heap_size = 256 * 1024 * 1024;  // 256 MB
    cudaDeviceSetLimit(cudaLimitMallocHeapSize, heap_size);
}

🚨 Device malloc Is Extremely Slow

Device-side malloc() and free() use a global heap with a mutex-like serialization mechanism. Throughput is orders of magnitude worse than host-side cudaMalloc. Use device malloc only for small, infrequent allocations. For performance-critical code, pre-allocate all memory from the host.

Performance Characteristics of Device-Side Launches

The overhead of launching a kernel from device code is substantially higher than from the host:

#include <cuda_runtime.h>
#include <cstdio>

__global__ void empty_child() {
    // Does nothing — measures pure launch overhead
}

__global__ void measure_launch_overhead(float* timings, int num_launches) {
    long long start, end;

    for (int i = 0; i < num_launches; i++) {
        start = clock64();
        empty_child<<<1, 1>>>();
        cudaDeviceSynchronize();
        end = clock64();

        timings[i] = (float)(end - start);
    }
}

// Host-side measurement for comparison
void measure_host_launch_overhead(int num_launches) {
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    for (int i = 0; i < num_launches; i++) {
        empty_child<<<1, 1>>>();
    }
    cudaDeviceSynchronize();
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    printf("Host launch: %.1f us per launch\n", ms * 1000.0f / num_launches);
}

📊

Kernel Launch Overhead: Host vs Device (A100)

Launch Source	Overhead Per Launch	Notes
Host -> GPU	~5-10 us	Includes PCIe/command buffer latency
Device -> GPU (no sync)	~15-25 us	Scheduling + context switch on GPU
Device -> GPU (with sync)	~25-50 us	Launch + child execution + sync
CUDA Graph (host)	~3-5 us	Pre-recorded, minimal overhead

Note: Device-side launches are 2-5x more expensive than host launches. The GPU runtime must manage child grid scheduling, memory visibility fences, and synchronization tracking — all from device code.

When Dynamic Parallelism Loses

If the total number of child launches is known at the host, launching them from the host (potentially in different streams) is almost always faster:

// BAD: Launch N independent kernels from device code
__global__ void parent_many_children(float** buffers, int* sizes, int num_tasks) {
    int task_id = threadIdx.x;
    if (task_id >= num_tasks) return;

    child_kernel<<<(sizes[task_id] + 255) / 256, 256>>>(
        buffers[task_id], 0, sizes[task_id]);
}

// GOOD: Launch N independent kernels from host, each in its own stream
void host_many_children(float** d_buffers, int* h_sizes, int num_tasks) {
    cudaStream_t* streams = new cudaStream_t[num_tasks];
    for (int i = 0; i < num_tasks; i++) {
        cudaStreamCreate(&streams[i]);
        child_kernel<<<(h_sizes[i] + 255) / 256, 256, 0, streams[i]>>>(
            d_buffers[i], 0, h_sizes[i]);
    }
    cudaDeviceSynchronize();

    for (int i = 0; i < num_tasks; i++) cudaStreamDestroy(streams[i]);
    delete[] streams;
}

Where Dynamic Parallelism Helps: Adaptive Algorithms

The legitimate use case for DP is when the work decomposition depends on data that is only available on the GPU, and the decomposition is recursive or highly irregular.

#include <cuda_runtime.h>

struct Cell {
    float value;
    float gradient;
    int level;          // Refinement level
    int child_offset;   // Index of first child in next level
};

__device__ float compute_gradient(const Cell* cells, int idx, int nx) {
    float dx = cells[idx + 1].value - cells[idx - 1].value;
    float dy = cells[idx + nx].value - cells[idx - nx].value;
    return sqrtf(dx * dx + dy * dy);
}

// Refine cells with high gradient by launching child kernels
__global__ void adaptive_refine(Cell* cells, Cell* refined_cells,
                                 int* num_refined, int n, int nx,
                                 float threshold, int max_level) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx <= 0 || idx >= n - 1) return;

    float grad = compute_gradient(cells, idx, nx);
    cells[idx].gradient = grad;

    if (grad > threshold && cells[idx].level < max_level) {
        // This cell needs refinement — allocate child cells
        int child_idx = atomicAdd(num_refined, 4);  // 4 children per cell
        cells[idx].child_offset = child_idx;

        // Initialize children at higher resolution
        float base_val = cells[idx].value;
        for (int c = 0; c < 4; c++) {
            refined_cells[child_idx + c].value = base_val;
            refined_cells[child_idx + c].level = cells[idx].level + 1;
            refined_cells[child_idx + c].child_offset = -1;
        }

        // Recursively refine children (this is where DP shines)
        // Only launch if there is enough work
        __syncthreads();  // Ensure all children are written

        if (threadIdx.x == 0 && *num_refined > 0) {
            int child_count = *num_refined;
            int block_size = 256;
            int grid_size = (child_count + block_size - 1) / block_size;

            adaptive_refine<<<grid_size, block_size>>>(
                refined_cells, refined_cells + child_count,
                num_refined, child_count, nx,
                threshold * 0.5f, max_level);  // Tighter threshold at finer levels
        }
    }
}

Recursive Quicksort

__global__ void quicksort_device(float* data, int left, int right) {
    if (right - left <= 1024) {
        // Base case: sort small arrays with a simple kernel
        // (insertion sort or bitonic sort for small sizes)
        if (threadIdx.x == 0) {
            // Simple insertion sort for small partitions
            for (int i = left + 1; i <= right; i++) {
                float key = data[i];
                int j = i - 1;
                while (j >= left && data[j] > key) {
                    data[j + 1] = data[j];
                    j--;
                }
                data[j + 1] = key;
            }
        }
        return;
    }

    // Partition step: parallel partition around pivot
    __shared__ float pivot;
    __shared__ int partition_idx;

    if (threadIdx.x == 0) {
        // Median-of-three pivot selection
        float a = data[left], b = data[(left + right) / 2], c = data[right];
        pivot = (a <= b) ? ((b <= c) ? b : ((a <= c) ? c : a))
                         : ((a <= c) ? a : ((b <= c) ? c : b));
    }
    __syncthreads();

    // Parallel partition (simplified — production code uses scan-based partition)
    if (threadIdx.x == 0) {
        int store_idx = left;
        for (int i = left; i <= right; i++) {
            if (data[i] < pivot) {
                float tmp = data[store_idx];
                data[store_idx] = data[i];
                data[i] = tmp;
                store_idx++;
            }
        }
        partition_idx = store_idx;
    }
    __syncthreads();

    // Recursive calls — only thread 0 launches children
    if (threadIdx.x == 0) {
        int mid = partition_idx;

        if (mid - left > 0) {
            quicksort_device<<<1, 256>>>(data, left, mid - 1);
        }
        if (right - mid > 0) {
            quicksort_device<<<1, 256>>>(data, mid, right);
        }
    }
}

⚡ DP Quicksort Is Rarely Competitive

Device-side quicksort with dynamic parallelism is conceptually clean but rarely beats CUB RadixSort or Thrust sort in practice. The launch overhead accumulates over O(log n) recursion levels, and the serial partition step underutilizes the GPU. Use DP quicksort for educational purposes; use CUB/Thrust for production sorting.

Nesting Depth and Resource Limits

// CUDA supports up to 24 levels of nesting (device dependent)
// Each level consumes additional memory for pending launch state

// Query nesting limit:
void check_nesting_limit() {
    int max_depth;
    cudaDeviceGetAttribute(&max_depth, cudaDevAttrMaxNestedDeviceLaunchs, 0);
    printf("Max nesting depth: %d\n", max_depth);  // Typically 24

    // Configure launch pending count (buffer for queued child launches)
    size_t pending_limit = 2048;  // Default is 2048 pending launches
    cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, pending_limit);

    // If too many children are launched without sync, launches will fail
    // with cudaErrorLaunchPendingCountExceeded
}

Checking for Device-Side Errors

__global__ void parent_with_error_check(float* data, int n) {
    child_kernel<<<(n + 255) / 256, 256>>>(data, 0, n);

    // Check for launch errors
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        // Launch failed — could be:
        // cudaErrorLaunchPendingCountExceeded
        // cudaErrorInvalidConfiguration
        // cudaErrorMemoryAllocation
        printf("Child launch failed: %d\n", (int)err);
        return;
    }

    cudaDeviceSynchronize();

    // Check for execution errors
    err = cudaGetLastError();
    if (err != cudaSuccess) {
        printf("Child execution failed: %d\n", (int)err);
    }
}

Stream Management on Device

Device code can create and use streams for concurrent child execution:

__global__ void parent_with_streams(float* data, int n) {
    // Create device-side streams
    cudaStream_t stream1, stream2;
    cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);

    int half = n / 2;

    // Launch children in separate streams for concurrency
    child_kernel<<<(half + 255) / 256, 256, 0, stream1>>>(data, 0, half);
    child_kernel<<<(half + 255) / 256, 256, 0, stream2>>>(data, half, half);

    // Wait for both
    cudaDeviceSynchronize();  // Waits for all streams

    // Clean up
    cudaStreamDestroy(stream1);
    cudaStreamDestroy(stream2);
}

⚠️ Device Stream Overhead

Creating and destroying streams on the device is expensive (~10-20 us each). If you need device-side streams, create them once at the beginning of the parent kernel and reuse them across multiple launches. Alternatively, use the default stream (stream 0) and accept serial execution of children.

Practical Example: Sparse Matrix Operations

Dynamic parallelism is well-suited for sparse data structures where the work per element varies dramatically:

#include <cuda_runtime.h>

// CSR format sparse matrix
struct CSRMatrix {
    float* values;
    int* col_indices;
    int* row_offsets;
    int num_rows;
    int nnz;
};

// Child: multiply one row by a dense vector
__global__ void spmv_row(const float* values, const int* col_indices,
                          const float* x, float* y,
                          int row_start, int row_end) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int nnz_in_row = row_end - row_start;

    __shared__ float partial[256];
    float sum = 0.0f;

    for (int i = idx; i < nnz_in_row; i += blockDim.x * gridDim.x) {
        int col = col_indices[row_start + i];
        sum += values[row_start + i] * x[col];
    }

    // Reduce within block
    partial[threadIdx.x] = sum;
    __syncthreads();
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (threadIdx.x < s) partial[threadIdx.x] += partial[threadIdx.x + s];
        __syncthreads();
    }

    if (threadIdx.x == 0) atomicAdd(&y[blockIdx.y], partial[0]);
}

// Parent: launch appropriately-sized child for each row
__global__ void spmv_adaptive(const CSRMatrix mat, const float* x, float* y) {
    int row = threadIdx.x + blockIdx.x * blockDim.x;
    if (row >= mat.num_rows) return;

    int row_start = mat.row_offsets[row];
    int row_end = mat.row_offsets[row + 1];
    int nnz_in_row = row_end - row_start;

    // Adaptive: choose child grid size based on row density
    if (nnz_in_row == 0) {
        y[row] = 0.0f;
    } else if (nnz_in_row <= 32) {
        // Very sparse row: one warp is enough
        // Compute directly, no child launch needed
        float sum = 0.0f;
        for (int i = row_start; i < row_end; i++) {
            sum += mat.values[i] * x[mat.col_indices[i]];
        }
        y[row] = sum;
    } else {
        // Dense row: launch child kernel
        int block_size = 256;
        int grid_size = (nnz_in_row + block_size - 1) / block_size;
        if (grid_size > 32) grid_size = 32;

        y[row] = 0.0f;  // Initialize before child writes
        spmv_row<<<grid_size, block_size>>>(
            mat.values, mat.col_indices, x, y, row_start, row_end);
    }
}

📊

Sparse Matrix-Vector: Flat vs Adaptive DP (A100)

Matrix	Rows	NNZ/Row Range	Flat (ms)	Adaptive DP (ms)	Speedup
Uniform sparse	100K	10-12	0.8	1.2	0.67x (slower)
Power-law	100K	1-50000	3.2	1.8	1.78x
Block diagonal	100K	0-1024	2.1	1.4	1.50x

Note: Dynamic parallelism only helps when the work distribution is highly non-uniform. For uniform sparsity, the launch overhead dominates. For power-law distributions (common in graph problems), DP can significantly improve load balancing.

Alternatives to Dynamic Parallelism

Before reaching for DP, consider these alternatives that often perform better:

Alternative 1: Persistent Threads

// Instead of launching child kernels, use persistent threads with a work queue
__global__ void persistent_workers(float* data, int* task_queue,
                                    int* task_sizes, int* task_count) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    while (true) {
        // Atomically grab next task
        int task_id = atomicAdd(task_count, 1);
        if (task_id >= MAX_TASKS) break;

        // Process task
        int start = task_queue[task_id];
        int size = task_sizes[task_id];

        for (int i = 0; i < size; i++) {
            data[start + i] *= 2.0f;
        }
    }
}

Alternative 2: Pre-Computed Launch Configurations

// Kernel 1: Analyze data and compute launch configurations
__global__ void analyze_work(const float* data, int* grid_sizes,
                              int* offsets, int n) {
    // Determine how much work each chunk needs
    int chunk = blockIdx.x;
    int chunk_start = chunk * CHUNK_SIZE;
    int count = 0;

    for (int i = chunk_start; i < chunk_start + CHUNK_SIZE && i < n; i++) {
        if (data[i] > threshold) count++;
    }

    grid_sizes[chunk] = (count + 255) / 256;
    offsets[chunk] = chunk_start;
}

// Host reads grid_sizes, launches appropriately sized kernels
// This avoids DP entirely at the cost of one extra kernel + host readback

Alternative 3: Cooperative Groups Grid Sync

// For algorithms that need "phases" but not recursive decomposition,
// grid.sync() is cheaper than child launches
#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void multi_phase_kernel(float* data, int n) {
    cg::grid_group grid = cg::this_grid();

    // Phase 1
    int idx = grid.thread_rank();
    if (idx < n) data[idx] = process_phase1(data[idx]);

    grid.sync();

    // Phase 2 (depends on Phase 1 results)
    if (idx < n) data[idx] = process_phase2(data, idx, n);

    grid.sync();

    // Phase 3 ...
}

Work Decomposition Approaches: Overhead Comparison

(microseconds per dispatch)

Host launch

7 microseconds per dispatch

CUDA Graph Best

4 microseconds per dispatch

Grid sync

5 microseconds per dispatch

Device launch DP

30 microseconds per dispatch

Device launch + sync

50 microseconds per dispatch

Tail Launch Optimization (CUDA 12+)

CUDA 12 introduced tail launch as a more efficient alternative to recursive DP:

// Tail launch: instead of parent + child running concurrently,
// the parent's resources are freed when the child launches.
// This reduces memory pressure and avoids the nesting stack.

// In CUDA 12, use cudaLaunchDevice with cudaLaunchDeviceTailLaunch flag:
__global__ void iterative_with_tail(float* data, int n, int iteration) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        data[idx] = process(data[idx], iteration);
    }

    // Instead of recursive launch, use tail launch
    if (threadIdx.x == 0 && blockIdx.x == 0 && iteration < MAX_ITER) {
        // Tail launch: current grid is retired, new grid takes over
        void* args[] = { &data, &n, &(++iteration) };
        cudaLaunchDeviceV2(
            args, gridDim, blockDim, 0, cudaStreamTailLaunch);
    }
}

Complete Example: Recursive Octree Traversal

A practical DP application for spatial queries:

#include <cuda_runtime.h>

struct OctreeNode {
    float center_x, center_y, center_z;
    float half_size;
    int children[8];     // -1 if no child
    int data_start;      // Index into point array
    int data_count;      // Number of points in this node
    bool is_leaf;
};

struct QueryResult {
    int point_idx;
    float distance;
};

__global__ void octree_range_query(const OctreeNode* nodes,
                                    const float* points_x,
                                    const float* points_y,
                                    const float* points_z,
                                    float query_x, float query_y, float query_z,
                                    float radius,
                                    QueryResult* results, int* result_count,
                                    int node_idx, int max_results) {
    OctreeNode node = nodes[node_idx];

    // Check if query sphere intersects this node's bounding box
    float dx = fmaxf(0.0f, fabsf(query_x - node.center_x) - node.half_size);
    float dy = fmaxf(0.0f, fabsf(query_y - node.center_y) - node.half_size);
    float dz = fmaxf(0.0f, fabsf(query_z - node.center_z) - node.half_size);
    float dist_sq = dx * dx + dy * dy + dz * dz;

    if (dist_sq > radius * radius) return;  // No intersection

    if (node.is_leaf) {
        // Check actual points in this leaf
        int tid = threadIdx.x;
        for (int i = tid; i < node.data_count; i += blockDim.x) {
            int pidx = node.data_start + i;
            float px = points_x[pidx] - query_x;
            float py = points_y[pidx] - query_y;
            float pz = points_z[pidx] - query_z;
            float d = sqrtf(px * px + py * py + pz * pz);

            if (d <= radius) {
                int slot = atomicAdd(result_count, 1);
                if (slot < max_results) {
                    results[slot].point_idx = pidx;
                    results[slot].distance = d;
                }
            }
        }
    } else {
        // Recurse into children — this is where DP is natural
        if (threadIdx.x == 0) {
            for (int c = 0; c < 8; c++) {
                if (node.children[c] >= 0) {
                    octree_range_query<<<1, 64>>>(
                        nodes, points_x, points_y, points_z,
                        query_x, query_y, query_z, radius,
                        results, result_count,
                        node.children[c], max_results);
                }
            }
        }
    }
}

📊

Octree Range Query: DP vs Host-Driven BFS (A100, 10M points)

Approach	Latency (ms)	Notes
Dynamic Parallelism	2.4	Natural recursion, high launch overhead
Host-driven BFS	3.8	Multiple kernel launches, host round-trips
Persistent thread + stack	1.6	Best performance, complex code
Flattened BFS (single kernel)	1.9	Pre-flatten tree to array

Note: DP is not the fastest approach, but it is the simplest to implement correctly for tree traversals. The persistent thread approach requires manual stack management in shared memory.

Decision Framework: When to Use Dynamic Parallelism

Use dynamic parallelism when ALL of the following are true:

The work decomposition is data-dependent — you cannot determine the number or size of sub-tasks from the host
The recursion depth is bounded and shallow (ideally 2-4 levels)
Each child kernel does substantial work (thousands of threads, microseconds of execution)
The alternative requires many host-GPU round-trips that dominate total execution time
Code clarity matters — DP can make recursive algorithms dramatically simpler to implement

Do NOT use dynamic parallelism when:

The work decomposition is known at host launch time (use streams/multi-kernel)
Child kernels are small (launch overhead dominates)
You need maximum throughput (persistent threads or cooperative groups are faster)
The recursion depth is unbounded (will exhaust pending launch buffers)

DP Decision: Problem Size vs Launch Overhead Break-Even

(microseconds)

Launch overhead Fixed cost

30 microseconds

100 threads child Loss

2 microseconds

10K threads child Break even

15 microseconds

100K threads child DP wins

120 microseconds

1M threads child DP wins

800 microseconds

Summary

Dynamic Parallelism enables device-side kernel launches with kernel<<<grid, block>>>() syntax in device code. Child kernels execute asynchronously and can be synchronized with cudaDeviceSynchronize(). Memory visibility follows launch-as-fence semantics for global memory; shared memory and registers are not shared between parent and child. The primary cost is 20-50 microseconds per device-side launch, which means DP is only beneficial when child kernels are large enough to amortize the overhead and when the work decomposition is truly data-dependent. For uniform or host-predictable workloads, multi-stream host launches, CUDA Graphs, or cooperative groups with grid sync are faster alternatives. The sweet spot for DP is recursive tree traversals, adaptive mesh refinement, and irregular sparse computations where the decomposition cannot be determined without first inspecting the data on the GPU.

Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps

Basic Device-Side Kernel Launch

Compilation Requirements

Synchronization Semantics

Implicit Synchronization

Memory Visibility Rules

Device-Side malloc and Free

Performance Characteristics of Device-Side Launches

Kernel Launch Overhead: Host vs Device (A100)

When Dynamic Parallelism Loses

Where Dynamic Parallelism Helps: Adaptive Algorithms

Adaptive Mesh Refinement (AMR)

Recursive Quicksort

Nesting Depth and Resource Limits

Checking for Device-Side Errors

Stream Management on Device

Practical Example: Sparse Matrix Operations

Sparse Matrix-Vector: Flat vs Adaptive DP (A100)

Alternatives to Dynamic Parallelism

Alternative 1: Persistent Threads

Alternative 2: Pre-Computed Launch Configurations

Alternative 3: Cooperative Groups Grid Sync

Work Decomposition Approaches: Overhead Comparison

Tail Launch Optimization (CUDA 12+)

Complete Example: Recursive Octree Traversal

Octree Range Query: DP vs Host-Driven BFS (A100, 10M points)

Decision Framework: When to Use Dynamic Parallelism

DP Decision: Problem Size vs Launch Overhead Break-Even

Summary

Stanley Phoong

Basic Device-Side Kernel Launch

Compilation Requirements

Synchronization Semantics

Implicit Synchronization

Memory Visibility Rules

Device-Side malloc and Free

Performance Characteristics of Device-Side Launches

Kernel Launch Overhead: Host vs Device (A100)

When Dynamic Parallelism Loses

Where Dynamic Parallelism Helps: Adaptive Algorithms

Adaptive Mesh Refinement (AMR)

Recursive Quicksort

Nesting Depth and Resource Limits

Checking for Device-Side Errors

Stream Management on Device

Practical Example: Sparse Matrix Operations

Sparse Matrix-Vector: Flat vs Adaptive DP (A100)

Alternatives to Dynamic Parallelism

Alternative 1: Persistent Threads

Alternative 2: Pre-Computed Launch Configurations

Alternative 3: Cooperative Groups Grid Sync

Work Decomposition Approaches: Overhead Comparison

Tail Launch Optimization (CUDA 12+)

Complete Example: Recursive Octree Traversal

Octree Range Query: DP vs Host-Driven BFS (A100, 10M points)

Decision Framework: When to Use Dynamic Parallelism

DP Decision: Problem Size vs Launch Overhead Break-Even

Summary

Stanley Phoong

Related Posts

Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot

CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance

Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation