Part of Series CUDA Kernel Engineering 20 of 32
1 CUDA Thread Hierarchy: Grids, Blocks, Warps, and the Execution Model That Determines Performance 2 Memory Coalescing: Why Access Patterns Determine 10x Performance Differences 3 Shared Memory and Bank Conflicts: 32 Banks, 4-Byte Width, and the Padding Trick 4 Warp Primitives: Shuffle, Vote, Match, and Cooperative Reduction Without Shared Memory 5 Tensor Cores: WMMA, MMA, and WGMMA — Matrix Multiply at Hardware Speed 6 Triton Kernel Development: Writing GPU Kernels in Python with Auto-Tuning 7 Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion 8 Nsight Compute and Nsight Systems: The Complete GPU Profiling Workflow 9 CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling 10 Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination 11 Occupancy Calculator: Registers, Shared Memory, Block Size, and Finding the Sweet Spot 12 Vectorized Loads: float4, int4, and 128-Bit Memory Transactions for Maximum Bandwidth 13 Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation 14 Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps 15 CUDA Streams and Events: Concurrent Execution, Overlap, and Synchronization Patterns 16 Reduction Patterns: Sum, Max, Histogram — From Naive to Warp-Optimized 17 Parallel Scan and Prefix Sum: Blelloch Algorithm, Work-Efficient Implementation 18 Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free 19 Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style 20 Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection 21 CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library 22 Persistent Kernels: Long-Running Thread Blocks for Continuous Inference Processing 23 Memory Access Pattern Analysis: From Roofline Model to Kernel Optimization Strategy 24 CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles 25 CUDA Kernel Fusion: Reducing Memory Traffic for Elementwise-Heavy Workloads 26 CUDA Kernel Optimization: A Systematic Guide from Roofline to Nsight 27 CUDA Streams: Overlapping PCIe Transfers with Compute (and When It Actually Helps) 28 CUDA Unified Memory: When It Helps, When It Hurts, and Grace Hopper 29 CUDA Warp Mastery: Scheduling, Divergence, Shuffles, Occupancy, and Profiling 30 eBPF for LLM Inference Profiling: Kernel-Level Observability 31 GPU Memory Profiling: Finding Leaks, Fragmentation, and Hidden Overhead 32 The Roofline Model for GPU Kernel Optimization: From First Principles to LLM Workload Analysis

A CUDA kernel corrupts memory, but the error manifests three kernels later in an unrelated LayerNorm operation. The crash report blames LayerNorm. You spend four hours debugging LayerNorm before discovering the real culprit was an out-of-bounds write in an attention kernel. This delayed error reporting is CUDA’s default behavior: the GPU executes asynchronously, errors are reported at the next synchronization point, and the stack trace points to wherever that synchronization happened. compute-sanitizer solves this by instrumenting every memory access and reporting the exact line where the out-of-bounds write occurred. It adds 50-100x overhead, but it turns “random memory corruption” into “line 47 writes 1 byte past the end of array.”

All examples target NVIDIA Ampere (A100-80GB SXM, SM 8.0) with CUDA 12.x toolkit.

Error Checking: The Foundation

Every CUDA API call returns cudaError_t. Every kernel launch can fail. Check both:

#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>

// Macro for API call checking
#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = (call);                                          \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "CUDA error at %s:%d: %s (%s)\n",            \
                    __FILE__, __LINE__,                                     \
                    cudaGetErrorString(err), cudaGetErrorName(err));       \
            exit(EXIT_FAILURE);                                            \
        }                                                                  \
    } while (0)

// Macro for kernel launch checking
#define CUDA_KERNEL_CHECK()                                                \
    do {                                                                    \
        cudaError_t err = cudaGetLastError();                             \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "Kernel launch error at %s:%d: %s (%s)\n",   \
                    __FILE__, __LINE__,                                     \
                    cudaGetErrorString(err), cudaGetErrorName(err));       \
            exit(EXIT_FAILURE);                                            \
        }                                                                  \
        err = cudaDeviceSynchronize();                                     \
        if (err != cudaSuccess) {                                          \
            fprintf(stderr, "Kernel execution error at %s:%d: %s (%s)\n",\
                    __FILE__, __LINE__,                                     \
                    cudaGetErrorString(err), cudaGetErrorName(err));       \
            exit(EXIT_FAILURE);                                            \
        }                                                                  \
    } while (0)

// Usage:
void example() {
    float* d_ptr;
    CUDA_CHECK(cudaMalloc(&d_ptr, 1024 * sizeof(float)));

    my_kernel<<<grid, block>>>(d_ptr, 1024);
    CUDA_KERNEL_CHECK();

    CUDA_CHECK(cudaFree(d_ptr));
}
🚨 Asynchronous Error Reporting

Kernel errors are asynchronous. A kernel launch returns immediately; the error surfaces at the next synchronizing CUDA call (cudaDeviceSynchronize, cudaMemcpy, cudaStreamSynchronize, or any other call that synchronizes). Without explicit checks, a bug in kernel A may appear as an error from cudaMemcpy three lines later. Use CUDA_KERNEL_CHECK() after every kernel during debugging to pinpoint the failing kernel.

Common CUDA Error Codes

📊

Common CUDA Errors and Root Causes

Error CodeNameTypical Cause
700 cudaErrorIllegalAddress Out-of-bounds global/local memory access
701 cudaErrorLaunchOutOfResources Too many registers or too much smem for launch config
702 cudaErrorLaunchTimeout Kernel exceeded watchdog timer (display GPU)
710 cudaErrorMisalignedAddress Misaligned memory access (e.g., unaligned float4)
719 cudaErrorLaunchFailure Unspecified kernel failure (check sanitizer)
2 cudaErrorMemoryAllocation cudaMalloc failed (out of GPU memory)
9 cudaErrorInvalidConfiguration Invalid grid/block dimensions
400 cudaErrorInvalidResourceHandle Using destroyed stream/event
Note: Error 700 (illegal address) is the most common kernel bug. Error 701 occurs when launch config exceeds SM resources. Both can be intermittent if they depend on thread scheduling.

Error 700: Illegal Address — Debugging Workflow

// Bug: out-of-bounds access
__global__ void buggy_kernel(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    // BUG: missing bounds check
    data[idx] = data[idx] * 2.0f;  // Crashes if idx >= n
}

// Fix: add bounds check
__global__ void fixed_kernel(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f;
    }
}

// Another common cause: using freed memory
void use_after_free_bug() {
    float* d_ptr;
    cudaMalloc(&d_ptr, 1024 * sizeof(float));
    cudaFree(d_ptr);

    // BUG: kernel uses freed memory
    my_kernel<<<4, 256>>>(d_ptr, 1024);  // Error 700
}

Error 701: Launch Out of Resources

// This launch configuration exceeds SM limits
__global__ void too_many_regs() {
    // Kernel uses 128 registers per thread
    float a[32];  // These may be in registers
    // ...
}

void launch_error_701() {
    // 1024 threads * 128 registers = 131072 > 65536 per SM
    // Cannot schedule even 1 block
    too_many_regs<<<1, 1024>>>();  // Error 701
    cudaDeviceSynchronize();

    // Fix: reduce block size or register usage
    too_many_regs<<<1, 256>>>();  // 256 * 128 = 32768 < 65536
}

compute-sanitizer: Memory Error Detection

compute-sanitizer (replacing cuda-memcheck from older toolkits) instruments kernel execution to detect memory errors at the exact instruction:

# Basic memory check (detects out-of-bounds, misaligned access, use-after-free)
compute-sanitizer ./my_program

# With more detail (source line numbers)
compute-sanitizer --show-backtrace=yes ./my_program

# Save report to file
compute-sanitizer --log-file report.txt ./my_program

# Detect only specific error types
compute-sanitizer --tool memcheck ./my_program

Example: Detecting Out-of-Bounds Access

#include <cuda_runtime.h>

__global__ void oob_kernel(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    // Intentional bug: no bounds check
    data[idx] = idx * 1.0f;
}

int main() {
    float* d_data;
    cudaMalloc(&d_data, 100 * sizeof(float));  // 100 elements

    // Launch with 256 threads — 156 threads go out of bounds
    oob_kernel<<<1, 256>>>(d_data, 100);
    cudaDeviceSynchronize();

    cudaFree(d_data);
    return 0;
}
$ compute-sanitizer ./oob_test
========= Invalid __global__ write of size 4 bytes
=========     at 0x0000000000000170 in oob_kernel(float*, int)
=========     by thread (100,0,0) in block (0,0,0)
=========     Address 0x7f1234567890 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x12345] in main at oob_test.cu:12
=========
========= ERROR SUMMARY: 156 errors

The output tells you exactly which thread, which instruction, and which address caused the error.

Detecting Misaligned Access

__global__ void misalign_kernel(float* data) {
    // BUG: cast unaligned address to float4*
    char* base = (char*)data;
    float4* bad_ptr = (float4*)(base + 3);  // 3-byte offset, not 16-byte aligned
    *bad_ptr = make_float4(1.0f, 2.0f, 3.0f, 4.0f);  // Misaligned write
}
$ compute-sanitizer ./misalign_test
========= Invalid __global__ write of size 16 bytes
=========     at 0x0000000000000120 in misalign_kernel(float*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f1234567893 is misaligned
=========     Expected alignment: 16 bytes

compute-sanitizer —tool racecheck: Detecting Race Conditions

The racecheck tool detects data races in shared memory:

compute-sanitizer --tool racecheck ./my_program
compute-sanitizer --tool racecheck --racecheck-report all ./my_program

Example: Missing __syncthreads()

__global__ void race_kernel(float* output, int n) {
    __shared__ float smem[256];

    int tid = threadIdx.x;

    // Write to shared memory
    smem[tid] = (float)tid;

    // BUG: missing __syncthreads() here
    // Some threads may read before others have written

    // Read from shared memory (depends on other threads' writes)
    if (tid > 0) {
        output[tid] = smem[tid] + smem[tid - 1];  // Race condition!
    }
}
$ compute-sanitizer --tool racecheck ./race_test
========= RACECHECK SUMMARY: 1 hazard displayed (1 write, 0 reads)
=========
========= Race reported between Write access at 0x120 by thread (5,0,0)
========= and Read access at 0x140 by thread (6,0,0)
=========     in race_kernel(float*, int)
=========     Shared memory address: 0x14
=========     (WAR hazard)

Example: WAW (Write-After-Write) Race

__global__ void waw_race(float* output) {
    __shared__ float smem[32];
    int tid = threadIdx.x;

    // Multiple threads write to the same location
    smem[tid % 8] = (float)tid;  // Threads 0,8,16,24 all write to smem[0]

    __syncthreads();
    if (tid < 8) output[tid] = smem[tid];  // Which thread's value do we get?
}
⚠️ racecheck Performance Impact

The racecheck tool instruments every shared memory access, slowing execution by 10-100x. Use it for debugging, not profiling. Enable it for specific kernels using --kernel-name to reduce overhead.

compute-sanitizer —tool synccheck: Synchronization Errors

The synccheck tool detects invalid synchronization patterns:

compute-sanitizer --tool synccheck ./my_program

Example: Divergent __syncthreads()

__global__ void bad_sync(float* data, int n) {
    int tid = threadIdx.x;

    if (tid < 128) {
        // Only half the block reaches this syncthreads
        __syncthreads();  // BUG: not all threads in block participate
        data[tid] = 42.0f;
    }
}
$ compute-sanitizer --tool synccheck ./sync_test
========= Barrier error detected. Divergent __syncthreads()
=========     at 0x100 in bad_sync(float*, int)
=========     Divergent thread (128,0,0) in block (0,0,0) is not at barrier
=========     while thread (0,0,0) in block (0,0,0) is at barrier

Example: Missing __syncwarp() After Warp-Level Operations

__global__ void missing_syncwarp(float* data) {
    __shared__ float smem[256];
    int tid = threadIdx.x;
    int lane = tid & 31;

    smem[tid] = (float)tid;
    __syncthreads();

    // Warp-divergent read-modify-write without proper sync
    if (lane < 16) {
        smem[tid] += smem[tid + 16];
        // BUG: on Volta+, need __syncwarp() before next access
        // Threads in the same warp may not be in lockstep
    }
    __syncwarp();  // Fix: add this

    if (lane < 8) {
        smem[tid] += smem[tid + 8];
    }
    __syncwarp();
}

cuda-gdb: Interactive Debugging

cuda-gdb extends GDB for GPU debugging. Compile with -G (device debug info) and -lineinfo:

# Compile with debug symbols
nvcc -g -G -arch=sm_80 -o debug_kernel kernel.cu

# Launch debugger
cuda-gdb ./debug_kernel

Basic cuda-gdb Commands

# Standard GDB commands work:
(cuda-gdb) break main
(cuda-gdb) run
(cuda-gdb) next
(cuda-gdb) print variable

# GPU-specific commands:
(cuda-gdb) info cuda threads          # List all GPU threads
(cuda-gdb) info cuda blocks           # List all GPU blocks
(cuda-gdb) info cuda kernels          # List active kernels
(cuda-gdb) info cuda lanes            # List lanes in current warp

# Switch to a specific GPU thread:
(cuda-gdb) cuda thread (0,0,0)        # Switch to threadIdx=(0,0,0)
(cuda-gdb) cuda block (2,0,0)         # Switch to blockIdx=(2,0,0)
(cuda-gdb) cuda kernel 0              # Switch to kernel 0

# Set breakpoints on device code:
(cuda-gdb) break my_kernel            # Break at kernel entry
(cuda-gdb) break kernel.cu:42         # Break at line 42

# Conditional breakpoints:
(cuda-gdb) break my_kernel if threadIdx.x == 5 && blockIdx.x == 0

# Inspect GPU state:
(cuda-gdb) print threadIdx
(cuda-gdb) print blockIdx
(cuda-gdb) print blockDim
(cuda-gdb) print gridDim

# Inspect shared memory:
(cuda-gdb) print @shared float[10]    # Print 10 floats from shared memory
(cuda-gdb) x/10f @shared              # Examine shared memory as floats

# Inspect registers:
(cuda-gdb) info registers             # All registers for current thread

Debugging Workflow Example

$ cuda-gdb ./buggy_app
(cuda-gdb) set cuda break_on_launch application
(cuda-gdb) run
# Stops at first kernel launch

(cuda-gdb) info cuda kernels
# Kernel 0: my_kernel<<<(128,1,1),(256,1,1)>>>

(cuda-gdb) break my_kernel
(cuda-gdb) continue
# Stops at kernel entry

(cuda-gdb) cuda thread (100,0,0) block (0,0,0)
# Switch to thread 100 in block 0

(cuda-gdb) next
(cuda-gdb) print idx
$1 = 100

(cuda-gdb) next
# Step through the kernel instruction by instruction
(cuda-gdb) print data[idx]
$2 = 3.14
ℹ️ -G Disables Optimizations

Compiling with -G disables all device code optimizations (no inlining, no register optimization, no instruction scheduling). The kernel will run 10-100x slower and may use more registers. Bugs that depend on optimization behavior (race conditions, register spilling) may not reproduce with -G. Use -lineinfo (without -G) for source-level profiling without disabling optimizations.

printf Debugging on the GPU

CUDA supports printf() from device code:

#include <cstdio>

__global__ void debug_with_printf(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx < n) {
        float val = data[idx];

        // Print only for specific threads to avoid output flood
        if (idx < 5 || idx == n - 1) {
            printf("Thread %d: val = %f\n", idx, val);
        }

        // Conditional debug print
        if (isnan(val) || isinf(val)) {
            printf("ERROR: Thread %d has NaN/Inf at data[%d]\n", idx, idx);
        }

        data[idx] = val * 2.0f;
    }
}
⚠️ GPU printf Buffer

GPU printf output is buffered (default 1 MB). If the buffer fills, output is silently dropped. Increase the buffer with cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size). Output appears only after the kernel completes (or after cudaDeviceSynchronize()). Printf from every thread of a large kernel will overwhelm the buffer and your terminal — always gate prints to specific threads.

// Increase printf buffer
cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 64 * 1024 * 1024);  // 64 MB

// Gated printf: only print from thread 0 of block 0
__global__ void gated_printf(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    __shared__ float debug_val;
    if (threadIdx.x == 0) {
        debug_val = data[blockIdx.x * blockDim.x];
    }
    __syncthreads();

    if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Block 0 first element: %f\n", debug_val);
    }
}

Detecting NaN and Inf

NaN propagation is a common bug in numerical kernels. Detect it early:

// NaN/Inf detection kernel — run after suspicious operations
__global__ void check_nan_inf(const float* data, int n, int* error_flag,
                               int* error_index) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        float val = data[idx];
        if (isnan(val) || isinf(val)) {
            // Atomic so only first error is recorded
            int old = atomicCAS(error_flag, 0, 1);
            if (old == 0) {
                *error_index = idx;
            }
        }
    }
}

// Wrapper function
bool check_tensor_valid(const float* d_data, int n) {
    int *d_flag, *d_index;
    cudaMalloc(&d_flag, sizeof(int));
    cudaMalloc(&d_index, sizeof(int));
    cudaMemset(d_flag, 0, sizeof(int));

    int block = 256;
    int grid = (n + block - 1) / block;
    check_nan_inf<<<grid, block>>>(d_data, n, d_flag, d_index);

    int h_flag, h_index;
    cudaMemcpy(&h_flag, d_flag, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(&h_index, d_index, sizeof(int), cudaMemcpyDeviceToHost);

    if (h_flag) {
        fprintf(stderr, "NaN/Inf detected at index %d\n", h_index);
    }

    cudaFree(d_flag);
    cudaFree(d_index);
    return h_flag == 0;
}

Common Sources of NaN in CUDA Kernels

// 1. Division by zero
float inv = 1.0f / sum;  // NaN if sum == 0.0f
// Fix: add epsilon
float inv = 1.0f / (sum + 1e-8f);

// 2. sqrt of negative
float val = sqrtf(x);  // NaN if x < 0
// Fix:
float val = sqrtf(fmaxf(x, 0.0f));

// 3. log of zero or negative
float val = logf(x);  // -Inf if x == 0, NaN if x < 0
// Fix:
float val = logf(fmaxf(x, 1e-8f));

// 4. exp overflow
float val = expf(x);  // Inf if x > ~88.7 (FP32)
// Fix: clamp input
float val = expf(fminf(x, 88.0f));

// 5. Softmax overflow (use log-sum-exp trick)
// BAD:
float sum = 0.0f;
for (int i = 0; i < n; i++) sum += expf(data[i]);  // Overflow if max > 88
// GOOD:
float max_val = /* ... reduce to find max ... */;
float sum = 0.0f;
for (int i = 0; i < n; i++) sum += expf(data[i] - max_val);  // Stable

Debugging Race Conditions Systematically

Race conditions are the hardest GPU bugs to debug because they are non-deterministic. A systematic approach:

Step 1: Reproduce Deterministically

// Force deterministic execution by serializing blocks
// WARNING: This changes timing and may not reproduce all races
__global__ void debug_serialized(float* data, int n) {
    // Process one block at a time (debug only)
    if (blockIdx.x > 0) return;

    int tid = threadIdx.x;
    // ... normal kernel code ...
}

// Alternative: run with 1 block, 1 warp
debug_kernel<<<1, 32>>>(d_data, n);

Step 2: Use racecheck

compute-sanitizer --tool racecheck --racecheck-report all ./program 2>&1 | head -100

Step 3: Insert Assertions

#include <cassert>

__global__ void kernel_with_assertions(float* data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // Bounds assertion
    assert(idx >= 0);  // Always true, but documents intent
    if (idx >= n) return;

    // Value range assertion
    float val = data[idx];
    assert(!isnan(val));
    assert(val >= -1e6f && val <= 1e6f);

    // Index assertion (catch off-by-one)
    int neighbor = idx + 1;
    assert(neighbor < n);  // Will fire if idx == n-1

    data[idx] = val + data[neighbor];
}

Step 4: Verify with Known Inputs

// Generate known-good test case
void test_kernel_correctness() {
    int n = 1024;
    float *h_data = new float[n];
    float *h_expected = new float[n];

    // Initialize with simple pattern
    for (int i = 0; i < n; i++) h_data[i] = (float)i;

    // Compute expected output on CPU
    for (int i = 0; i < n; i++) {
        h_expected[i] = h_data[i] * 2.0f;  // Expected behavior
    }

    // Run GPU kernel
    float *d_data;
    cudaMalloc(&d_data, n * sizeof(float));
    cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice);

    my_kernel<<<(n + 255) / 256, 256>>>(d_data, n);
    cudaDeviceSynchronize();

    float *h_result = new float[n];
    cudaMemcpy(h_result, d_data, n * sizeof(float), cudaMemcpyDeviceToHost);

    // Compare
    float max_diff = 0.0f;
    int errors = 0;
    for (int i = 0; i < n; i++) {
        float diff = fabsf(h_result[i] - h_expected[i]);
        if (diff > 1e-5f) {
            if (errors < 10) {
                printf("Mismatch at %d: expected %f, got %f (diff %f)\n",
                       i, h_expected[i], h_result[i], diff);
            }
            errors++;
        }
        if (diff > max_diff) max_diff = diff;
    }
    printf("Total errors: %d, max diff: %e\n", errors, max_diff);

    delete[] h_data;
    delete[] h_expected;
    delete[] h_result;
    cudaFree(d_data);
}

initcheck: Uninitialized Memory Detection

compute-sanitizer --tool initcheck ./my_program
__global__ void uninit_bug(float* output) {
    __shared__ float smem[256];

    int tid = threadIdx.x;

    // BUG: only some threads initialize shared memory
    if (tid < 128) {
        smem[tid] = (float)tid;
    }
    __syncthreads();

    // Thread 200 reads smem[200], which was never written
    output[tid] = smem[tid];  // Uninitialized read for tid >= 128
}
$ compute-sanitizer --tool initcheck ./uninit_test
========= Uninitialized __shared__ memory read of size 4 bytes
=========     at 0x160 in uninit_bug(float*)
=========     by thread (128,0,0) in block (0,0,0)
=========     Address 0x200

CUDA-MEMCHECK with ASAN-Like Features

# Full instrumentation suite
compute-sanitizer --tool memcheck --leak-check full ./program

# Detect device-side memory leaks (malloc from device code)
compute-sanitizer --tool memcheck --leak-check full ./program

# Detect host API errors
compute-sanitizer --tool memcheck --error-exitcode 1 ./program

# Run only specific kernels
compute-sanitizer --tool memcheck --kernel-name "my_kernel" ./program

Memory Leak Detection

__global__ void leaky_kernel() {
    // Device-side malloc without free
    float* ptr = (float*)malloc(1024 * sizeof(float));
    ptr[0] = 42.0f;
    // BUG: never freed
}
$ compute-sanitizer --tool memcheck --leak-check full ./leak_test
========= Leaked 4096 bytes at 0x7f0000001000
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Leaked in leaky_kernel() at leak_test.cu:3

Debugging Workflow: Complete Example

A buggy reduction kernel and the process of finding the bugs:

#include <cuda_runtime.h>
#include <cstdio>

// This kernel has multiple bugs — find them all
__global__ void buggy_reduce(const float* input, float* output, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Bug 1: no bounds check on load
    sdata[tid] = input[idx];
    __syncthreads();

    // Bug 2: reduction loop goes one step too far
    for (int s = blockDim.x; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];  // Bug 3: out-of-bounds when s = blockDim.x
        }
        // Bug 4: syncthreads inside divergent branch? No, it's outside the if
        __syncthreads();
    }

    // Bug 5: race condition — multiple threads write to output[blockIdx.x]
    output[blockIdx.x] = sdata[0];  // Every thread writes, not just thread 0
}

Debugging Steps

# Step 1: Run with compute-sanitizer
$ compute-sanitizer --show-backtrace yes ./buggy_reduce
========= Invalid __shared__ read of size 4 bytes
=========     at 0x180 in buggy_reduce
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x400 is out of bounds
# This catches Bug 3: sdata[tid + s] when s = 256, tid = 0 -> sdata[256] out of bounds

# Step 2: Run racecheck
$ compute-sanitizer --tool racecheck ./buggy_reduce
# Detects Bug 5: multiple threads writing to output[blockIdx.x]

# Step 3: Run with sanitizer after fixing obvious bugs
# Fix Bug 1: add bounds check
# Fix Bug 2: start loop at blockDim.x / 2
# Fix Bug 3: fixed by Bug 2 fix
# Fix Bug 5: gate output write with if (tid == 0)

The Fixed Kernel

__global__ void fixed_reduce(const float* input, float* output, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Fix 1: bounds check
    sdata[tid] = (idx < n) ? input[idx] : 0.0f;
    __syncthreads();

    // Fix 2: start at blockDim.x / 2
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // Fix 5: only thread 0 writes
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

Debugging Tool Overhead

(slowdown factor (x))
No tools Baseline
1 slowdown factor (x)
-lineinfo Minimal
1.05 slowdown factor (x)
-G (debug)
15 slowdown factor (x)
memcheck
30 slowdown factor (x)
racecheck Very slow
80 slowdown factor (x)
initcheck
50 slowdown factor (x)

Production Error Handling Pattern

For production code, use a lightweight error handler that logs errors without crashing:

#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>
#include <mutex>
#include <string>

class CudaErrorHandler {
public:
    static void check(cudaError_t err, const char* file, int line) {
        if (err != cudaSuccess) {
            fprintf(stderr, "[CUDA ERROR] %s (%d) at %s:%d\n",
                    cudaGetErrorString(err), (int)err, file, line);

            // In production: log, maybe retry, maybe gracefully degrade
            // For debugging: abort
            #ifdef DEBUG
            abort();
            #endif
        }
    }

    static void checkKernel(const char* kernel_name,
                             const char* file, int line) {
        cudaError_t err = cudaGetLastError();
        if (err != cudaSuccess) {
            fprintf(stderr, "[CUDA LAUNCH ERROR] %s: %s at %s:%d\n",
                    kernel_name, cudaGetErrorString(err), file, line);
        }

        #ifdef DEBUG
        // Synchronize after every kernel in debug builds
        err = cudaDeviceSynchronize();
        if (err != cudaSuccess) {
            fprintf(stderr, "[CUDA EXEC ERROR] %s: %s at %s:%d\n",
                    kernel_name, cudaGetErrorString(err), file, line);
            abort();
        }
        #endif
    }

    // Check for sticky errors (some errors persist until explicitly cleared)
    static bool hasStickyError() {
        cudaError_t err = cudaPeekAtLastError();
        return err != cudaSuccess;
    }

    static void clearError() {
        cudaGetLastError();  // Clears the error
    }
};

#define CUDA_CHECK(call) CudaErrorHandler::check((call), __FILE__, __LINE__)
#define CUDA_KERNEL_CHECK(name) CudaErrorHandler::checkKernel(name, __FILE__, __LINE__)
💡 Sticky Errors

Some CUDA errors are “sticky” — they persist across CUDA API calls until explicitly cleared with cudaGetLastError(). If your application hits an error and then tries to recover, the sticky error will cause all subsequent CUDA calls to return the old error. Always check and clear errors before attempting recovery.

Summary

CUDA debugging requires a multi-tool approach. Always wrap CUDA API calls with error checking macros, and use cudaDeviceSynchronize() + cudaGetLastError() after kernel launches during development. compute-sanitizer --tool memcheck catches out-of-bounds access, use-after-free, and misaligned access with exact thread and instruction identification. compute-sanitizer --tool racecheck detects shared memory data races from missing __syncthreads() barriers. cuda-gdb provides interactive stepping through device code. printf debugging on the GPU is viable for small-scale inspection (gate it to specific threads). For numerical bugs, insert NaN/Inf checks after each kernel in the pipeline. The overhead of debugging tools (10-100x slowdown) means they should be used during development and testing, not production. Compile with -lineinfo (not -G) for production builds so Nsight Compute can map profiling results to source lines without disabling optimizations.