A recursive octree traversal implemented with host-side launches requires one CPU-GPU round-trip per tree level — 8 microseconds per level, 80 microseconds for a depth-10 tree. Implement the same algorithm with Dynamic Parallelism and the GPU launches child kernels directly from parent kernels without CPU involvement. Total latency: 50 microseconds. The 40% speedup comes from eliminating CPU-GPU synchronization points. But dynamic parallelism is not free: each device-side launch costs 20-50 microseconds versus 8 microseconds for a host-side launch. It wins only when the alternative involves many round-trips or when the work decomposition is data-dependent and cannot be precomputed on the host.
All measurements target NVIDIA Ampere (A100-80GB SXM, SM 8.0) unless stated otherwise.
Basic Device-Side Kernel Launch
#include <cuda_runtime.h>
#include <cstdio>
// Child kernel — runs on the GPU, launched by parent kernel
__global__ void child_kernel(float* data, int start, int count) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < count) {
data[start + idx] *= 2.0f;
}
}
// Parent kernel — launches child kernels from the GPU
__global__ void parent_kernel(float* data, int n, int num_chunks) {
int chunk_id = threadIdx.x + blockIdx.x * blockDim.x;
if (chunk_id >= num_chunks) return;
int chunk_size = n / num_chunks;
int start = chunk_id * chunk_size;
int count = (chunk_id == num_chunks - 1) ? (n - start) : chunk_size;
// Launch child kernel from device code
int block_size = 256;
int grid_size = (count + block_size - 1) / block_size;
child_kernel<<<grid_size, block_size>>>(data, start, count);
}
int main() {
int n = 1 << 20;
float* d_data;
cudaMalloc(&d_data, n * sizeof(float));
// Launch parent with 1 block of 16 threads -> 16 child kernels
parent_kernel<<<1, 16>>>(d_data, n, 16);
cudaDeviceSynchronize();
cudaFree(d_data);
return 0;
}
Compilation Requirements
# Dynamic parallelism requires:
# Compute capability >= 3.5 (sm_35+)
# Relocatable device code (-rdc=true)
# Link with cudadevrt library (-lcudadevrt)
nvcc -arch=sm_80 -rdc=true kernel.cu -o kernel -lcudadevrt
# Without -rdc=true, the compiler will reject device-side <<<>>> launches
Synchronization Semantics
Child kernels launched by a parent are asynchronous by default. The parent thread continues executing while child grids are scheduled. Explicit synchronization is required to ensure child completion:
__global__ void parent_with_sync(float* data, float* results, int n) {
int tid = threadIdx.x;
// Phase 1: Launch children
int chunk_size = n / blockDim.x;
int start = tid * chunk_size;
int count = chunk_size;
child_kernel<<<(count + 255) / 256, 256>>>(data, start, count);
// Phase 2: Wait for ALL child kernels launched by this block
// cudaDeviceSynchronize() on device waits for all children of the calling thread
cudaDeviceSynchronize();
// Phase 3: Now safe to read results written by children
if (tid == 0) {
float sum = 0.0f;
for (int i = 0; i < n; i++) sum += data[i];
results[blockIdx.x] = sum;
}
}
cudaDeviceSynchronize() called from device code synchronizes ALL child grids launched by the calling thread block, not just those launched by the calling thread. If thread 0 launches kernel A and thread 1 launches kernel B, a cudaDeviceSynchronize() from any thread in the block waits for both A and B. There is no per-thread synchronization for child kernels.
Implicit Synchronization
When a parent kernel completes (all parent threads exit), the CUDA runtime implicitly waits for all child and descendant kernels to complete before signaling the parent’s stream event to the host:
__global__ void parent_no_sync(float* data, int n) {
// Launch child
child_kernel<<<(n + 255) / 256, 256>>>(data, 0, n);
// Parent exits WITHOUT calling cudaDeviceSynchronize()
// The child WILL complete before the host sees the parent's stream event
// But the parent thread itself does NOT wait — it exits immediately
}
// Host code:
// parent_no_sync<<<1, 1>>>(d_data, n);
// cudaDeviceSynchronize(); // Waits for parent AND all descendants
Memory Visibility Rules
Memory visibility between parent and child follows specific rules:
__global__ void memory_visibility_demo(float* global_data, int n) {
// Rule 1: Global memory written by parent is visible to child
// AFTER the child launch (launch acts as a memory fence)
global_data[threadIdx.x] = (float)threadIdx.x;
// The <<<>>> launch implies a memory fence for global memory
child_kernel<<<1, 256>>>(global_data, 0, 256);
// Rule 2: Global memory written by child is visible to parent
// ONLY after cudaDeviceSynchronize()
cudaDeviceSynchronize();
float val = global_data[0]; // Now safe to read child's output
// Rule 3: Shared memory is NOT visible to children
__shared__ float smem[256];
smem[threadIdx.x] = 42.0f;
// child_kernel CANNOT access smem — it has its own shared memory
// Rule 4: Local variables (registers) are NOT shared
float local_val = 3.14f;
// child_kernel cannot see local_val
// Rule 5: Constant memory is visible (read-only, set before parent launch)
}
Device-Side malloc and Free
// Device-side memory allocation (from a pre-allocated device heap)
__global__ void device_malloc_demo(float** results, int n) {
// Allocate memory from the device heap
float* local_buffer = (float*)malloc(n * sizeof(float));
if (local_buffer == nullptr) {
// Device heap exhausted — need to increase with cudaDeviceSetLimit
return;
}
// Use the buffer
for (int i = 0; i < n; i++) {
local_buffer[i] = (float)i;
}
// Pass to child
child_kernel<<<(n + 255) / 256, 256>>>(local_buffer, 0, n);
cudaDeviceSynchronize();
// Store result pointer
results[blockIdx.x] = local_buffer;
// WARNING: Must free from device code or the memory leaks
// free(local_buffer); // If not needed after this kernel
}
// Configure device heap size before launching
void configure_heap() {
// Default device heap is 8 MB
size_t heap_size = 256 * 1024 * 1024; // 256 MB
cudaDeviceSetLimit(cudaLimitMallocHeapSize, heap_size);
}
Device-side malloc() and free() use a global heap with a mutex-like serialization mechanism. Throughput is orders of magnitude worse than host-side cudaMalloc. Use device malloc only for small, infrequent allocations. For performance-critical code, pre-allocate all memory from the host.
Performance Characteristics of Device-Side Launches
The overhead of launching a kernel from device code is substantially higher than from the host:
#include <cuda_runtime.h>
#include <cstdio>
__global__ void empty_child() {
// Does nothing — measures pure launch overhead
}
__global__ void measure_launch_overhead(float* timings, int num_launches) {
long long start, end;
for (int i = 0; i < num_launches; i++) {
start = clock64();
empty_child<<<1, 1>>>();
cudaDeviceSynchronize();
end = clock64();
timings[i] = (float)(end - start);
}
}
// Host-side measurement for comparison
void measure_host_launch_overhead(int num_launches) {
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (int i = 0; i < num_launches; i++) {
empty_child<<<1, 1>>>();
}
cudaDeviceSynchronize();
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
printf("Host launch: %.1f us per launch\n", ms * 1000.0f / num_launches);
}
Kernel Launch Overhead: Host vs Device (A100)
| Launch Source | Overhead Per Launch | Notes |
|---|---|---|
| Host -> GPU | ~5-10 us | Includes PCIe/command buffer latency |
| Device -> GPU (no sync) | ~15-25 us | Scheduling + context switch on GPU |
| Device -> GPU (with sync) | ~25-50 us | Launch + child execution + sync |
| CUDA Graph (host) | ~3-5 us | Pre-recorded, minimal overhead |
When Dynamic Parallelism Loses
If the total number of child launches is known at the host, launching them from the host (potentially in different streams) is almost always faster:
// BAD: Launch N independent kernels from device code
__global__ void parent_many_children(float** buffers, int* sizes, int num_tasks) {
int task_id = threadIdx.x;
if (task_id >= num_tasks) return;
child_kernel<<<(sizes[task_id] + 255) / 256, 256>>>(
buffers[task_id], 0, sizes[task_id]);
}
// GOOD: Launch N independent kernels from host, each in its own stream
void host_many_children(float** d_buffers, int* h_sizes, int num_tasks) {
cudaStream_t* streams = new cudaStream_t[num_tasks];
for (int i = 0; i < num_tasks; i++) {
cudaStreamCreate(&streams[i]);
child_kernel<<<(h_sizes[i] + 255) / 256, 256, 0, streams[i]>>>(
d_buffers[i], 0, h_sizes[i]);
}
cudaDeviceSynchronize();
for (int i = 0; i < num_tasks; i++) cudaStreamDestroy(streams[i]);
delete[] streams;
}
Where Dynamic Parallelism Helps: Adaptive Algorithms
The legitimate use case for DP is when the work decomposition depends on data that is only available on the GPU, and the decomposition is recursive or highly irregular.
Adaptive Mesh Refinement (AMR)
#include <cuda_runtime.h>
struct Cell {
float value;
float gradient;
int level; // Refinement level
int child_offset; // Index of first child in next level
};
__device__ float compute_gradient(const Cell* cells, int idx, int nx) {
float dx = cells[idx + 1].value - cells[idx - 1].value;
float dy = cells[idx + nx].value - cells[idx - nx].value;
return sqrtf(dx * dx + dy * dy);
}
// Refine cells with high gradient by launching child kernels
__global__ void adaptive_refine(Cell* cells, Cell* refined_cells,
int* num_refined, int n, int nx,
float threshold, int max_level) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx <= 0 || idx >= n - 1) return;
float grad = compute_gradient(cells, idx, nx);
cells[idx].gradient = grad;
if (grad > threshold && cells[idx].level < max_level) {
// This cell needs refinement — allocate child cells
int child_idx = atomicAdd(num_refined, 4); // 4 children per cell
cells[idx].child_offset = child_idx;
// Initialize children at higher resolution
float base_val = cells[idx].value;
for (int c = 0; c < 4; c++) {
refined_cells[child_idx + c].value = base_val;
refined_cells[child_idx + c].level = cells[idx].level + 1;
refined_cells[child_idx + c].child_offset = -1;
}
// Recursively refine children (this is where DP shines)
// Only launch if there is enough work
__syncthreads(); // Ensure all children are written
if (threadIdx.x == 0 && *num_refined > 0) {
int child_count = *num_refined;
int block_size = 256;
int grid_size = (child_count + block_size - 1) / block_size;
adaptive_refine<<<grid_size, block_size>>>(
refined_cells, refined_cells + child_count,
num_refined, child_count, nx,
threshold * 0.5f, max_level); // Tighter threshold at finer levels
}
}
}
Recursive Quicksort
__global__ void quicksort_device(float* data, int left, int right) {
if (right - left <= 1024) {
// Base case: sort small arrays with a simple kernel
// (insertion sort or bitonic sort for small sizes)
if (threadIdx.x == 0) {
// Simple insertion sort for small partitions
for (int i = left + 1; i <= right; i++) {
float key = data[i];
int j = i - 1;
while (j >= left && data[j] > key) {
data[j + 1] = data[j];
j--;
}
data[j + 1] = key;
}
}
return;
}
// Partition step: parallel partition around pivot
__shared__ float pivot;
__shared__ int partition_idx;
if (threadIdx.x == 0) {
// Median-of-three pivot selection
float a = data[left], b = data[(left + right) / 2], c = data[right];
pivot = (a <= b) ? ((b <= c) ? b : ((a <= c) ? c : a))
: ((a <= c) ? a : ((b <= c) ? c : b));
}
__syncthreads();
// Parallel partition (simplified — production code uses scan-based partition)
if (threadIdx.x == 0) {
int store_idx = left;
for (int i = left; i <= right; i++) {
if (data[i] < pivot) {
float tmp = data[store_idx];
data[store_idx] = data[i];
data[i] = tmp;
store_idx++;
}
}
partition_idx = store_idx;
}
__syncthreads();
// Recursive calls — only thread 0 launches children
if (threadIdx.x == 0) {
int mid = partition_idx;
if (mid - left > 0) {
quicksort_device<<<1, 256>>>(data, left, mid - 1);
}
if (right - mid > 0) {
quicksort_device<<<1, 256>>>(data, mid, right);
}
}
}
Device-side quicksort with dynamic parallelism is conceptually clean but rarely beats CUB RadixSort or Thrust sort in practice. The launch overhead accumulates over O(log n) recursion levels, and the serial partition step underutilizes the GPU. Use DP quicksort for educational purposes; use CUB/Thrust for production sorting.
Nesting Depth and Resource Limits
// CUDA supports up to 24 levels of nesting (device dependent)
// Each level consumes additional memory for pending launch state
// Query nesting limit:
void check_nesting_limit() {
int max_depth;
cudaDeviceGetAttribute(&max_depth, cudaDevAttrMaxNestedDeviceLaunchs, 0);
printf("Max nesting depth: %d\n", max_depth); // Typically 24
// Configure launch pending count (buffer for queued child launches)
size_t pending_limit = 2048; // Default is 2048 pending launches
cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, pending_limit);
// If too many children are launched without sync, launches will fail
// with cudaErrorLaunchPendingCountExceeded
}
Checking for Device-Side Errors
__global__ void parent_with_error_check(float* data, int n) {
child_kernel<<<(n + 255) / 256, 256>>>(data, 0, n);
// Check for launch errors
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
// Launch failed — could be:
// cudaErrorLaunchPendingCountExceeded
// cudaErrorInvalidConfiguration
// cudaErrorMemoryAllocation
printf("Child launch failed: %d\n", (int)err);
return;
}
cudaDeviceSynchronize();
// Check for execution errors
err = cudaGetLastError();
if (err != cudaSuccess) {
printf("Child execution failed: %d\n", (int)err);
}
}
Stream Management on Device
Device code can create and use streams for concurrent child execution:
__global__ void parent_with_streams(float* data, int n) {
// Create device-side streams
cudaStream_t stream1, stream2;
cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);
int half = n / 2;
// Launch children in separate streams for concurrency
child_kernel<<<(half + 255) / 256, 256, 0, stream1>>>(data, 0, half);
child_kernel<<<(half + 255) / 256, 256, 0, stream2>>>(data, half, half);
// Wait for both
cudaDeviceSynchronize(); // Waits for all streams
// Clean up
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
}
Creating and destroying streams on the device is expensive (~10-20 us each). If you need device-side streams, create them once at the beginning of the parent kernel and reuse them across multiple launches. Alternatively, use the default stream (stream 0) and accept serial execution of children.
Practical Example: Sparse Matrix Operations
Dynamic parallelism is well-suited for sparse data structures where the work per element varies dramatically:
#include <cuda_runtime.h>
// CSR format sparse matrix
struct CSRMatrix {
float* values;
int* col_indices;
int* row_offsets;
int num_rows;
int nnz;
};
// Child: multiply one row by a dense vector
__global__ void spmv_row(const float* values, const int* col_indices,
const float* x, float* y,
int row_start, int row_end) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int nnz_in_row = row_end - row_start;
__shared__ float partial[256];
float sum = 0.0f;
for (int i = idx; i < nnz_in_row; i += blockDim.x * gridDim.x) {
int col = col_indices[row_start + i];
sum += values[row_start + i] * x[col];
}
// Reduce within block
partial[threadIdx.x] = sum;
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) partial[threadIdx.x] += partial[threadIdx.x + s];
__syncthreads();
}
if (threadIdx.x == 0) atomicAdd(&y[blockIdx.y], partial[0]);
}
// Parent: launch appropriately-sized child for each row
__global__ void spmv_adaptive(const CSRMatrix mat, const float* x, float* y) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row >= mat.num_rows) return;
int row_start = mat.row_offsets[row];
int row_end = mat.row_offsets[row + 1];
int nnz_in_row = row_end - row_start;
// Adaptive: choose child grid size based on row density
if (nnz_in_row == 0) {
y[row] = 0.0f;
} else if (nnz_in_row <= 32) {
// Very sparse row: one warp is enough
// Compute directly, no child launch needed
float sum = 0.0f;
for (int i = row_start; i < row_end; i++) {
sum += mat.values[i] * x[mat.col_indices[i]];
}
y[row] = sum;
} else {
// Dense row: launch child kernel
int block_size = 256;
int grid_size = (nnz_in_row + block_size - 1) / block_size;
if (grid_size > 32) grid_size = 32;
y[row] = 0.0f; // Initialize before child writes
spmv_row<<<grid_size, block_size>>>(
mat.values, mat.col_indices, x, y, row_start, row_end);
}
}
Sparse Matrix-Vector: Flat vs Adaptive DP (A100)
| Matrix | Rows | NNZ/Row Range | Flat (ms) | Adaptive DP (ms) | Speedup |
|---|---|---|---|---|---|
| Uniform sparse | 100K | 10-12 | 0.8 | 1.2 | 0.67x (slower) |
| Power-law | 100K | 1-50000 | 3.2 | 1.8 | 1.78x |
| Block diagonal | 100K | 0-1024 | 2.1 | 1.4 | 1.50x |
Alternatives to Dynamic Parallelism
Before reaching for DP, consider these alternatives that often perform better:
Alternative 1: Persistent Threads
// Instead of launching child kernels, use persistent threads with a work queue
__global__ void persistent_workers(float* data, int* task_queue,
int* task_sizes, int* task_count) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (true) {
// Atomically grab next task
int task_id = atomicAdd(task_count, 1);
if (task_id >= MAX_TASKS) break;
// Process task
int start = task_queue[task_id];
int size = task_sizes[task_id];
for (int i = 0; i < size; i++) {
data[start + i] *= 2.0f;
}
}
}
Alternative 2: Pre-Computed Launch Configurations
// Kernel 1: Analyze data and compute launch configurations
__global__ void analyze_work(const float* data, int* grid_sizes,
int* offsets, int n) {
// Determine how much work each chunk needs
int chunk = blockIdx.x;
int chunk_start = chunk * CHUNK_SIZE;
int count = 0;
for (int i = chunk_start; i < chunk_start + CHUNK_SIZE && i < n; i++) {
if (data[i] > threshold) count++;
}
grid_sizes[chunk] = (count + 255) / 256;
offsets[chunk] = chunk_start;
}
// Host reads grid_sizes, launches appropriately sized kernels
// This avoids DP entirely at the cost of one extra kernel + host readback
Alternative 3: Cooperative Groups Grid Sync
// For algorithms that need "phases" but not recursive decomposition,
// grid.sync() is cheaper than child launches
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void multi_phase_kernel(float* data, int n) {
cg::grid_group grid = cg::this_grid();
// Phase 1
int idx = grid.thread_rank();
if (idx < n) data[idx] = process_phase1(data[idx]);
grid.sync();
// Phase 2 (depends on Phase 1 results)
if (idx < n) data[idx] = process_phase2(data, idx, n);
grid.sync();
// Phase 3 ...
}
Work Decomposition Approaches: Overhead Comparison
(microseconds per dispatch)Tail Launch Optimization (CUDA 12+)
CUDA 12 introduced tail launch as a more efficient alternative to recursive DP:
// Tail launch: instead of parent + child running concurrently,
// the parent's resources are freed when the child launches.
// This reduces memory pressure and avoids the nesting stack.
// In CUDA 12, use cudaLaunchDevice with cudaLaunchDeviceTailLaunch flag:
__global__ void iterative_with_tail(float* data, int n, int iteration) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
data[idx] = process(data[idx], iteration);
}
// Instead of recursive launch, use tail launch
if (threadIdx.x == 0 && blockIdx.x == 0 && iteration < MAX_ITER) {
// Tail launch: current grid is retired, new grid takes over
void* args[] = { &data, &n, &(++iteration) };
cudaLaunchDeviceV2(
args, gridDim, blockDim, 0, cudaStreamTailLaunch);
}
}
Complete Example: Recursive Octree Traversal
A practical DP application for spatial queries:
#include <cuda_runtime.h>
struct OctreeNode {
float center_x, center_y, center_z;
float half_size;
int children[8]; // -1 if no child
int data_start; // Index into point array
int data_count; // Number of points in this node
bool is_leaf;
};
struct QueryResult {
int point_idx;
float distance;
};
__global__ void octree_range_query(const OctreeNode* nodes,
const float* points_x,
const float* points_y,
const float* points_z,
float query_x, float query_y, float query_z,
float radius,
QueryResult* results, int* result_count,
int node_idx, int max_results) {
OctreeNode node = nodes[node_idx];
// Check if query sphere intersects this node's bounding box
float dx = fmaxf(0.0f, fabsf(query_x - node.center_x) - node.half_size);
float dy = fmaxf(0.0f, fabsf(query_y - node.center_y) - node.half_size);
float dz = fmaxf(0.0f, fabsf(query_z - node.center_z) - node.half_size);
float dist_sq = dx * dx + dy * dy + dz * dz;
if (dist_sq > radius * radius) return; // No intersection
if (node.is_leaf) {
// Check actual points in this leaf
int tid = threadIdx.x;
for (int i = tid; i < node.data_count; i += blockDim.x) {
int pidx = node.data_start + i;
float px = points_x[pidx] - query_x;
float py = points_y[pidx] - query_y;
float pz = points_z[pidx] - query_z;
float d = sqrtf(px * px + py * py + pz * pz);
if (d <= radius) {
int slot = atomicAdd(result_count, 1);
if (slot < max_results) {
results[slot].point_idx = pidx;
results[slot].distance = d;
}
}
}
} else {
// Recurse into children — this is where DP is natural
if (threadIdx.x == 0) {
for (int c = 0; c < 8; c++) {
if (node.children[c] >= 0) {
octree_range_query<<<1, 64>>>(
nodes, points_x, points_y, points_z,
query_x, query_y, query_z, radius,
results, result_count,
node.children[c], max_results);
}
}
}
}
}
Octree Range Query: DP vs Host-Driven BFS (A100, 10M points)
| Approach | Latency (ms) | Notes |
|---|---|---|
| Dynamic Parallelism | 2.4 | Natural recursion, high launch overhead |
| Host-driven BFS | 3.8 | Multiple kernel launches, host round-trips |
| Persistent thread + stack | 1.6 | Best performance, complex code |
| Flattened BFS (single kernel) | 1.9 | Pre-flatten tree to array |
Decision Framework: When to Use Dynamic Parallelism
Use dynamic parallelism when ALL of the following are true:
- The work decomposition is data-dependent — you cannot determine the number or size of sub-tasks from the host
- The recursion depth is bounded and shallow (ideally 2-4 levels)
- Each child kernel does substantial work (thousands of threads, microseconds of execution)
- The alternative requires many host-GPU round-trips that dominate total execution time
- Code clarity matters — DP can make recursive algorithms dramatically simpler to implement
Do NOT use dynamic parallelism when:
- The work decomposition is known at host launch time (use streams/multi-kernel)
- Child kernels are small (launch overhead dominates)
- You need maximum throughput (persistent threads or cooperative groups are faster)
- The recursion depth is unbounded (will exhaust pending launch buffers)
DP Decision: Problem Size vs Launch Overhead Break-Even
(microseconds)Summary
Dynamic Parallelism enables device-side kernel launches with kernel<<<grid, block>>>() syntax in device code. Child kernels execute asynchronously and can be synchronized with cudaDeviceSynchronize(). Memory visibility follows launch-as-fence semantics for global memory; shared memory and registers are not shared between parent and child. The primary cost is 20-50 microseconds per device-side launch, which means DP is only beneficial when child kernels are large enough to amortize the overhead and when the work decomposition is truly data-dependent. For uniform or host-predictable workloads, multi-stream host launches, CUDA Graphs, or cooperative groups with grid sync are faster alternatives. The sweet spot for DP is recursive tree traversals, adaptive mesh refinement, and irregular sparse computations where the decomposition cannot be determined without first inspecting the data on the GPU.