A CUDA kernel corrupts memory, but the error manifests three kernels later in an unrelated LayerNorm operation. The crash report blames LayerNorm. You spend four hours debugging LayerNorm before discovering the real culprit was an out-of-bounds write in an attention kernel. This delayed error reporting is CUDA’s default behavior: the GPU executes asynchronously, errors are reported at the next synchronization point, and the stack trace points to wherever that synchronization happened. compute-sanitizer solves this by instrumenting every memory access and reporting the exact line where the out-of-bounds write occurred. It adds 50-100x overhead, but it turns “random memory corruption” into “line 47 writes 1 byte past the end of array.”
All examples target NVIDIA Ampere (A100-80GB SXM, SM 8.0) with CUDA 12.x toolkit.
Error Checking: The Foundation
Every CUDA API call returns cudaError_t. Every kernel launch can fail. Check both:
#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>
// Macro for API call checking
#define CUDA_CHECK(call) \
do { \
cudaError_t err = (call); \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s (%s)\n", \
__FILE__, __LINE__, \
cudaGetErrorString(err), cudaGetErrorName(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Macro for kernel launch checking
#define CUDA_KERNEL_CHECK() \
do { \
cudaError_t err = cudaGetLastError(); \
if (err != cudaSuccess) { \
fprintf(stderr, "Kernel launch error at %s:%d: %s (%s)\n", \
__FILE__, __LINE__, \
cudaGetErrorString(err), cudaGetErrorName(err)); \
exit(EXIT_FAILURE); \
} \
err = cudaDeviceSynchronize(); \
if (err != cudaSuccess) { \
fprintf(stderr, "Kernel execution error at %s:%d: %s (%s)\n",\
__FILE__, __LINE__, \
cudaGetErrorString(err), cudaGetErrorName(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Usage:
void example() {
float* d_ptr;
CUDA_CHECK(cudaMalloc(&d_ptr, 1024 * sizeof(float)));
my_kernel<<<grid, block>>>(d_ptr, 1024);
CUDA_KERNEL_CHECK();
CUDA_CHECK(cudaFree(d_ptr));
}
Kernel errors are asynchronous. A kernel launch returns immediately; the error surfaces at the next synchronizing CUDA call (cudaDeviceSynchronize, cudaMemcpy, cudaStreamSynchronize, or any other call that synchronizes). Without explicit checks, a bug in kernel A may appear as an error from cudaMemcpy three lines later. Use CUDA_KERNEL_CHECK() after every kernel during debugging to pinpoint the failing kernel.
Common CUDA Error Codes
Common CUDA Errors and Root Causes
| Error Code | Name | Typical Cause |
|---|---|---|
| 700 | cudaErrorIllegalAddress | Out-of-bounds global/local memory access |
| 701 | cudaErrorLaunchOutOfResources | Too many registers or too much smem for launch config |
| 702 | cudaErrorLaunchTimeout | Kernel exceeded watchdog timer (display GPU) |
| 710 | cudaErrorMisalignedAddress | Misaligned memory access (e.g., unaligned float4) |
| 719 | cudaErrorLaunchFailure | Unspecified kernel failure (check sanitizer) |
| 2 | cudaErrorMemoryAllocation | cudaMalloc failed (out of GPU memory) |
| 9 | cudaErrorInvalidConfiguration | Invalid grid/block dimensions |
| 400 | cudaErrorInvalidResourceHandle | Using destroyed stream/event |
Error 700: Illegal Address — Debugging Workflow
// Bug: out-of-bounds access
__global__ void buggy_kernel(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// BUG: missing bounds check
data[idx] = data[idx] * 2.0f; // Crashes if idx >= n
}
// Fix: add bounds check
__global__ void fixed_kernel(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
data[idx] = data[idx] * 2.0f;
}
}
// Another common cause: using freed memory
void use_after_free_bug() {
float* d_ptr;
cudaMalloc(&d_ptr, 1024 * sizeof(float));
cudaFree(d_ptr);
// BUG: kernel uses freed memory
my_kernel<<<4, 256>>>(d_ptr, 1024); // Error 700
}
Error 701: Launch Out of Resources
// This launch configuration exceeds SM limits
__global__ void too_many_regs() {
// Kernel uses 128 registers per thread
float a[32]; // These may be in registers
// ...
}
void launch_error_701() {
// 1024 threads * 128 registers = 131072 > 65536 per SM
// Cannot schedule even 1 block
too_many_regs<<<1, 1024>>>(); // Error 701
cudaDeviceSynchronize();
// Fix: reduce block size or register usage
too_many_regs<<<1, 256>>>(); // 256 * 128 = 32768 < 65536
}
compute-sanitizer: Memory Error Detection
compute-sanitizer (replacing cuda-memcheck from older toolkits) instruments kernel execution to detect memory errors at the exact instruction:
# Basic memory check (detects out-of-bounds, misaligned access, use-after-free)
compute-sanitizer ./my_program
# With more detail (source line numbers)
compute-sanitizer --show-backtrace=yes ./my_program
# Save report to file
compute-sanitizer --log-file report.txt ./my_program
# Detect only specific error types
compute-sanitizer --tool memcheck ./my_program
Example: Detecting Out-of-Bounds Access
#include <cuda_runtime.h>
__global__ void oob_kernel(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// Intentional bug: no bounds check
data[idx] = idx * 1.0f;
}
int main() {
float* d_data;
cudaMalloc(&d_data, 100 * sizeof(float)); // 100 elements
// Launch with 256 threads — 156 threads go out of bounds
oob_kernel<<<1, 256>>>(d_data, 100);
cudaDeviceSynchronize();
cudaFree(d_data);
return 0;
}
$ compute-sanitizer ./oob_test
========= Invalid __global__ write of size 4 bytes
========= at 0x0000000000000170 in oob_kernel(float*, int)
========= by thread (100,0,0) in block (0,0,0)
========= Address 0x7f1234567890 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x12345] in main at oob_test.cu:12
=========
========= ERROR SUMMARY: 156 errors
The output tells you exactly which thread, which instruction, and which address caused the error.
Detecting Misaligned Access
__global__ void misalign_kernel(float* data) {
// BUG: cast unaligned address to float4*
char* base = (char*)data;
float4* bad_ptr = (float4*)(base + 3); // 3-byte offset, not 16-byte aligned
*bad_ptr = make_float4(1.0f, 2.0f, 3.0f, 4.0f); // Misaligned write
}
$ compute-sanitizer ./misalign_test
========= Invalid __global__ write of size 16 bytes
========= at 0x0000000000000120 in misalign_kernel(float*)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7f1234567893 is misaligned
========= Expected alignment: 16 bytes
compute-sanitizer —tool racecheck: Detecting Race Conditions
The racecheck tool detects data races in shared memory:
compute-sanitizer --tool racecheck ./my_program
compute-sanitizer --tool racecheck --racecheck-report all ./my_program
Example: Missing __syncthreads()
__global__ void race_kernel(float* output, int n) {
__shared__ float smem[256];
int tid = threadIdx.x;
// Write to shared memory
smem[tid] = (float)tid;
// BUG: missing __syncthreads() here
// Some threads may read before others have written
// Read from shared memory (depends on other threads' writes)
if (tid > 0) {
output[tid] = smem[tid] + smem[tid - 1]; // Race condition!
}
}
$ compute-sanitizer --tool racecheck ./race_test
========= RACECHECK SUMMARY: 1 hazard displayed (1 write, 0 reads)
=========
========= Race reported between Write access at 0x120 by thread (5,0,0)
========= and Read access at 0x140 by thread (6,0,0)
========= in race_kernel(float*, int)
========= Shared memory address: 0x14
========= (WAR hazard)
Example: WAW (Write-After-Write) Race
__global__ void waw_race(float* output) {
__shared__ float smem[32];
int tid = threadIdx.x;
// Multiple threads write to the same location
smem[tid % 8] = (float)tid; // Threads 0,8,16,24 all write to smem[0]
__syncthreads();
if (tid < 8) output[tid] = smem[tid]; // Which thread's value do we get?
}
The racecheck tool instruments every shared memory access, slowing execution by 10-100x. Use it for debugging, not profiling. Enable it for specific kernels using --kernel-name to reduce overhead.
compute-sanitizer —tool synccheck: Synchronization Errors
The synccheck tool detects invalid synchronization patterns:
compute-sanitizer --tool synccheck ./my_program
Example: Divergent __syncthreads()
__global__ void bad_sync(float* data, int n) {
int tid = threadIdx.x;
if (tid < 128) {
// Only half the block reaches this syncthreads
__syncthreads(); // BUG: not all threads in block participate
data[tid] = 42.0f;
}
}
$ compute-sanitizer --tool synccheck ./sync_test
========= Barrier error detected. Divergent __syncthreads()
========= at 0x100 in bad_sync(float*, int)
========= Divergent thread (128,0,0) in block (0,0,0) is not at barrier
========= while thread (0,0,0) in block (0,0,0) is at barrier
Example: Missing __syncwarp() After Warp-Level Operations
__global__ void missing_syncwarp(float* data) {
__shared__ float smem[256];
int tid = threadIdx.x;
int lane = tid & 31;
smem[tid] = (float)tid;
__syncthreads();
// Warp-divergent read-modify-write without proper sync
if (lane < 16) {
smem[tid] += smem[tid + 16];
// BUG: on Volta+, need __syncwarp() before next access
// Threads in the same warp may not be in lockstep
}
__syncwarp(); // Fix: add this
if (lane < 8) {
smem[tid] += smem[tid + 8];
}
__syncwarp();
}
cuda-gdb: Interactive Debugging
cuda-gdb extends GDB for GPU debugging. Compile with -G (device debug info) and -lineinfo:
# Compile with debug symbols
nvcc -g -G -arch=sm_80 -o debug_kernel kernel.cu
# Launch debugger
cuda-gdb ./debug_kernel
Basic cuda-gdb Commands
# Standard GDB commands work:
(cuda-gdb) break main
(cuda-gdb) run
(cuda-gdb) next
(cuda-gdb) print variable
# GPU-specific commands:
(cuda-gdb) info cuda threads # List all GPU threads
(cuda-gdb) info cuda blocks # List all GPU blocks
(cuda-gdb) info cuda kernels # List active kernels
(cuda-gdb) info cuda lanes # List lanes in current warp
# Switch to a specific GPU thread:
(cuda-gdb) cuda thread (0,0,0) # Switch to threadIdx=(0,0,0)
(cuda-gdb) cuda block (2,0,0) # Switch to blockIdx=(2,0,0)
(cuda-gdb) cuda kernel 0 # Switch to kernel 0
# Set breakpoints on device code:
(cuda-gdb) break my_kernel # Break at kernel entry
(cuda-gdb) break kernel.cu:42 # Break at line 42
# Conditional breakpoints:
(cuda-gdb) break my_kernel if threadIdx.x == 5 && blockIdx.x == 0
# Inspect GPU state:
(cuda-gdb) print threadIdx
(cuda-gdb) print blockIdx
(cuda-gdb) print blockDim
(cuda-gdb) print gridDim
# Inspect shared memory:
(cuda-gdb) print @shared float[10] # Print 10 floats from shared memory
(cuda-gdb) x/10f @shared # Examine shared memory as floats
# Inspect registers:
(cuda-gdb) info registers # All registers for current thread
Debugging Workflow Example
$ cuda-gdb ./buggy_app
(cuda-gdb) set cuda break_on_launch application
(cuda-gdb) run
# Stops at first kernel launch
(cuda-gdb) info cuda kernels
# Kernel 0: my_kernel<<<(128,1,1),(256,1,1)>>>
(cuda-gdb) break my_kernel
(cuda-gdb) continue
# Stops at kernel entry
(cuda-gdb) cuda thread (100,0,0) block (0,0,0)
# Switch to thread 100 in block 0
(cuda-gdb) next
(cuda-gdb) print idx
$1 = 100
(cuda-gdb) next
# Step through the kernel instruction by instruction
(cuda-gdb) print data[idx]
$2 = 3.14
Compiling with -G disables all device code optimizations (no inlining, no register optimization, no instruction scheduling). The kernel will run 10-100x slower and may use more registers. Bugs that depend on optimization behavior (race conditions, register spilling) may not reproduce with -G. Use -lineinfo (without -G) for source-level profiling without disabling optimizations.
printf Debugging on the GPU
CUDA supports printf() from device code:
#include <cstdio>
__global__ void debug_with_printf(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
float val = data[idx];
// Print only for specific threads to avoid output flood
if (idx < 5 || idx == n - 1) {
printf("Thread %d: val = %f\n", idx, val);
}
// Conditional debug print
if (isnan(val) || isinf(val)) {
printf("ERROR: Thread %d has NaN/Inf at data[%d]\n", idx, idx);
}
data[idx] = val * 2.0f;
}
}
GPU printf output is buffered (default 1 MB). If the buffer fills, output is silently dropped. Increase the buffer with cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size). Output appears only after the kernel completes (or after cudaDeviceSynchronize()). Printf from every thread of a large kernel will overwhelm the buffer and your terminal — always gate prints to specific threads.
// Increase printf buffer
cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 64 * 1024 * 1024); // 64 MB
// Gated printf: only print from thread 0 of block 0
__global__ void gated_printf(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ float debug_val;
if (threadIdx.x == 0) {
debug_val = data[blockIdx.x * blockDim.x];
}
__syncthreads();
if (threadIdx.x == 0 && blockIdx.x == 0) {
printf("Block 0 first element: %f\n", debug_val);
}
}
Detecting NaN and Inf
NaN propagation is a common bug in numerical kernels. Detect it early:
// NaN/Inf detection kernel — run after suspicious operations
__global__ void check_nan_inf(const float* data, int n, int* error_flag,
int* error_index) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
float val = data[idx];
if (isnan(val) || isinf(val)) {
// Atomic so only first error is recorded
int old = atomicCAS(error_flag, 0, 1);
if (old == 0) {
*error_index = idx;
}
}
}
}
// Wrapper function
bool check_tensor_valid(const float* d_data, int n) {
int *d_flag, *d_index;
cudaMalloc(&d_flag, sizeof(int));
cudaMalloc(&d_index, sizeof(int));
cudaMemset(d_flag, 0, sizeof(int));
int block = 256;
int grid = (n + block - 1) / block;
check_nan_inf<<<grid, block>>>(d_data, n, d_flag, d_index);
int h_flag, h_index;
cudaMemcpy(&h_flag, d_flag, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&h_index, d_index, sizeof(int), cudaMemcpyDeviceToHost);
if (h_flag) {
fprintf(stderr, "NaN/Inf detected at index %d\n", h_index);
}
cudaFree(d_flag);
cudaFree(d_index);
return h_flag == 0;
}
Common Sources of NaN in CUDA Kernels
// 1. Division by zero
float inv = 1.0f / sum; // NaN if sum == 0.0f
// Fix: add epsilon
float inv = 1.0f / (sum + 1e-8f);
// 2. sqrt of negative
float val = sqrtf(x); // NaN if x < 0
// Fix:
float val = sqrtf(fmaxf(x, 0.0f));
// 3. log of zero or negative
float val = logf(x); // -Inf if x == 0, NaN if x < 0
// Fix:
float val = logf(fmaxf(x, 1e-8f));
// 4. exp overflow
float val = expf(x); // Inf if x > ~88.7 (FP32)
// Fix: clamp input
float val = expf(fminf(x, 88.0f));
// 5. Softmax overflow (use log-sum-exp trick)
// BAD:
float sum = 0.0f;
for (int i = 0; i < n; i++) sum += expf(data[i]); // Overflow if max > 88
// GOOD:
float max_val = /* ... reduce to find max ... */;
float sum = 0.0f;
for (int i = 0; i < n; i++) sum += expf(data[i] - max_val); // Stable
Debugging Race Conditions Systematically
Race conditions are the hardest GPU bugs to debug because they are non-deterministic. A systematic approach:
Step 1: Reproduce Deterministically
// Force deterministic execution by serializing blocks
// WARNING: This changes timing and may not reproduce all races
__global__ void debug_serialized(float* data, int n) {
// Process one block at a time (debug only)
if (blockIdx.x > 0) return;
int tid = threadIdx.x;
// ... normal kernel code ...
}
// Alternative: run with 1 block, 1 warp
debug_kernel<<<1, 32>>>(d_data, n);
Step 2: Use racecheck
compute-sanitizer --tool racecheck --racecheck-report all ./program 2>&1 | head -100
Step 3: Insert Assertions
#include <cassert>
__global__ void kernel_with_assertions(float* data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// Bounds assertion
assert(idx >= 0); // Always true, but documents intent
if (idx >= n) return;
// Value range assertion
float val = data[idx];
assert(!isnan(val));
assert(val >= -1e6f && val <= 1e6f);
// Index assertion (catch off-by-one)
int neighbor = idx + 1;
assert(neighbor < n); // Will fire if idx == n-1
data[idx] = val + data[neighbor];
}
Step 4: Verify with Known Inputs
// Generate known-good test case
void test_kernel_correctness() {
int n = 1024;
float *h_data = new float[n];
float *h_expected = new float[n];
// Initialize with simple pattern
for (int i = 0; i < n; i++) h_data[i] = (float)i;
// Compute expected output on CPU
for (int i = 0; i < n; i++) {
h_expected[i] = h_data[i] * 2.0f; // Expected behavior
}
// Run GPU kernel
float *d_data;
cudaMalloc(&d_data, n * sizeof(float));
cudaMemcpy(d_data, h_data, n * sizeof(float), cudaMemcpyHostToDevice);
my_kernel<<<(n + 255) / 256, 256>>>(d_data, n);
cudaDeviceSynchronize();
float *h_result = new float[n];
cudaMemcpy(h_result, d_data, n * sizeof(float), cudaMemcpyDeviceToHost);
// Compare
float max_diff = 0.0f;
int errors = 0;
for (int i = 0; i < n; i++) {
float diff = fabsf(h_result[i] - h_expected[i]);
if (diff > 1e-5f) {
if (errors < 10) {
printf("Mismatch at %d: expected %f, got %f (diff %f)\n",
i, h_expected[i], h_result[i], diff);
}
errors++;
}
if (diff > max_diff) max_diff = diff;
}
printf("Total errors: %d, max diff: %e\n", errors, max_diff);
delete[] h_data;
delete[] h_expected;
delete[] h_result;
cudaFree(d_data);
}
initcheck: Uninitialized Memory Detection
compute-sanitizer --tool initcheck ./my_program
__global__ void uninit_bug(float* output) {
__shared__ float smem[256];
int tid = threadIdx.x;
// BUG: only some threads initialize shared memory
if (tid < 128) {
smem[tid] = (float)tid;
}
__syncthreads();
// Thread 200 reads smem[200], which was never written
output[tid] = smem[tid]; // Uninitialized read for tid >= 128
}
$ compute-sanitizer --tool initcheck ./uninit_test
========= Uninitialized __shared__ memory read of size 4 bytes
========= at 0x160 in uninit_bug(float*)
========= by thread (128,0,0) in block (0,0,0)
========= Address 0x200
CUDA-MEMCHECK with ASAN-Like Features
# Full instrumentation suite
compute-sanitizer --tool memcheck --leak-check full ./program
# Detect device-side memory leaks (malloc from device code)
compute-sanitizer --tool memcheck --leak-check full ./program
# Detect host API errors
compute-sanitizer --tool memcheck --error-exitcode 1 ./program
# Run only specific kernels
compute-sanitizer --tool memcheck --kernel-name "my_kernel" ./program
Memory Leak Detection
__global__ void leaky_kernel() {
// Device-side malloc without free
float* ptr = (float*)malloc(1024 * sizeof(float));
ptr[0] = 42.0f;
// BUG: never freed
}
$ compute-sanitizer --tool memcheck --leak-check full ./leak_test
========= Leaked 4096 bytes at 0x7f0000001000
========= Saved host backtrace up to driver entry point at kernel launch time
========= Leaked in leaky_kernel() at leak_test.cu:3
Debugging Workflow: Complete Example
A buggy reduction kernel and the process of finding the bugs:
#include <cuda_runtime.h>
#include <cstdio>
// This kernel has multiple bugs — find them all
__global__ void buggy_reduce(const float* input, float* output, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Bug 1: no bounds check on load
sdata[tid] = input[idx];
__syncthreads();
// Bug 2: reduction loop goes one step too far
for (int s = blockDim.x; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s]; // Bug 3: out-of-bounds when s = blockDim.x
}
// Bug 4: syncthreads inside divergent branch? No, it's outside the if
__syncthreads();
}
// Bug 5: race condition — multiple threads write to output[blockIdx.x]
output[blockIdx.x] = sdata[0]; // Every thread writes, not just thread 0
}
Debugging Steps
# Step 1: Run with compute-sanitizer
$ compute-sanitizer --show-backtrace yes ./buggy_reduce
========= Invalid __shared__ read of size 4 bytes
========= at 0x180 in buggy_reduce
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x400 is out of bounds
# This catches Bug 3: sdata[tid + s] when s = 256, tid = 0 -> sdata[256] out of bounds
# Step 2: Run racecheck
$ compute-sanitizer --tool racecheck ./buggy_reduce
# Detects Bug 5: multiple threads writing to output[blockIdx.x]
# Step 3: Run with sanitizer after fixing obvious bugs
# Fix Bug 1: add bounds check
# Fix Bug 2: start loop at blockDim.x / 2
# Fix Bug 3: fixed by Bug 2 fix
# Fix Bug 5: gate output write with if (tid == 0)
The Fixed Kernel
__global__ void fixed_reduce(const float* input, float* output, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Fix 1: bounds check
sdata[tid] = (idx < n) ? input[idx] : 0.0f;
__syncthreads();
// Fix 2: start at blockDim.x / 2
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// Fix 5: only thread 0 writes
if (tid == 0) {
output[blockIdx.x] = sdata[0];
}
}
Debugging Tool Overhead
(slowdown factor (x))Production Error Handling Pattern
For production code, use a lightweight error handler that logs errors without crashing:
#include <cuda_runtime.h>
#include <cstdio>
#include <cstdlib>
#include <mutex>
#include <string>
class CudaErrorHandler {
public:
static void check(cudaError_t err, const char* file, int line) {
if (err != cudaSuccess) {
fprintf(stderr, "[CUDA ERROR] %s (%d) at %s:%d\n",
cudaGetErrorString(err), (int)err, file, line);
// In production: log, maybe retry, maybe gracefully degrade
// For debugging: abort
#ifdef DEBUG
abort();
#endif
}
}
static void checkKernel(const char* kernel_name,
const char* file, int line) {
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "[CUDA LAUNCH ERROR] %s: %s at %s:%d\n",
kernel_name, cudaGetErrorString(err), file, line);
}
#ifdef DEBUG
// Synchronize after every kernel in debug builds
err = cudaDeviceSynchronize();
if (err != cudaSuccess) {
fprintf(stderr, "[CUDA EXEC ERROR] %s: %s at %s:%d\n",
kernel_name, cudaGetErrorString(err), file, line);
abort();
}
#endif
}
// Check for sticky errors (some errors persist until explicitly cleared)
static bool hasStickyError() {
cudaError_t err = cudaPeekAtLastError();
return err != cudaSuccess;
}
static void clearError() {
cudaGetLastError(); // Clears the error
}
};
#define CUDA_CHECK(call) CudaErrorHandler::check((call), __FILE__, __LINE__)
#define CUDA_KERNEL_CHECK(name) CudaErrorHandler::checkKernel(name, __FILE__, __LINE__)
Some CUDA errors are “sticky” — they persist across CUDA API calls until explicitly cleared with cudaGetLastError(). If your application hits an error and then tries to recover, the sticky error will cause all subsequent CUDA calls to return the old error. Always check and clear errors before attempting recovery.
Summary
CUDA debugging requires a multi-tool approach. Always wrap CUDA API calls with error checking macros, and use cudaDeviceSynchronize() + cudaGetLastError() after kernel launches during development. compute-sanitizer --tool memcheck catches out-of-bounds access, use-after-free, and misaligned access with exact thread and instruction identification. compute-sanitizer --tool racecheck detects shared memory data races from missing __syncthreads() barriers. cuda-gdb provides interactive stepping through device code. printf debugging on the GPU is viable for small-scale inspection (gate it to specific threads). For numerical bugs, insert NaN/Inf checks after each kernel in the pipeline. The overhead of debugging tools (10-100x slowdown) means they should be used during development and testing, not production. Compile with -lineinfo (not -G) for production builds so Nsight Compute can map profiling results to source lines without disabling optimizations.