Production LLM Profiling with eBPF: Beyond nvidia-smi

nvidia-smi tells you GPU utilization. It doesn’t tell you why utilization dropped from 95% to 60%, or where those mysterious latency spikes come from. eBPF gives us the tools to trace the entire inference path from Python down to kernel driver calls.

The eBPF Advantage for GPU Profiling

Traditional GPU profiling tools (Nsight, nvprof) instrument the GPU side. But inference latency often stems from:

CUDA runtime overhead: Driver calls, context switches
CPU-GPU synchronization: cudaStreamSynchronize blocking
Memory allocation: cudaMalloc during inference
Python GIL contention: Threading issues in the inference server
System scheduling: CPU scheduler decisions affecting GPU kernel launch

eBPF traces all of these on the CPU side with minimal overhead (under 5%).

Tracing CUDA Runtime Calls

The CUDA runtime library (libcuda.so, libcudart.so) exposes user-space functions we can trace:

// bpftrace script: cuda_latency.bt
#!/usr/bin/env bpftrace

// Trace cudaLaunchKernel latency
uprobe:/usr/local/cuda/lib64/libcudart.so:cudaLaunchKernel
{
    @launch_start[tid] = nsecs;
}

uretprobe:/usr/local/cuda/lib64/libcudart.so:cudaLaunchKernel
/@launch_start[tid]/
{
    $latency = nsecs - @launch_start[tid];
    @cuda_launch_latency_us = hist($latency / 1000);
    delete(@launch_start[tid]);
}

// Trace cudaStreamSynchronize blocking time
uprobe:/usr/local/cuda/lib64/libcudart.so:cudaStreamSynchronize
{
    @sync_start[tid] = nsecs;
}

uretprobe:/usr/local/cuda/lib64/libcudart.so:cudaStreamSynchronize
/@sync_start[tid]/
{
    $latency = nsecs - @sync_start[tid];
    @stream_sync_latency_us = hist($latency / 1000);
    
    // Alert on unexpectedly long syncs
    if ($latency > 10000000) {  // > 10ms
        printf("WARN: cudaStreamSynchronize blocked %d ms (tid=%d)\n", 
               $latency / 1000000, tid);
    }
    delete(@sync_start[tid]);
}

// Trace cudaMalloc during inference (should be rare)
uprobe:/usr/local/cuda/lib64/libcudart.so:cudaMalloc
{
    @malloc_count++;
    @malloc_stack[ustack()] = count();
}

END
{
    printf("\n=== CUDA Launch Latency (µs) ===\n");
    print(@cuda_launch_latency_us);
    
    printf("\n=== Stream Sync Latency (µs) ===\n");
    print(@stream_sync_latency_us);
    
    printf("\n=== cudaMalloc calls: %d ===\n", @malloc_count);
    printf("Allocation stacks:\n");
    print(@malloc_stack);
}

⚡ Typical Results

Healthy inference shows cudaLaunchKernel at 2-5µs and rare cudaStreamSynchronize calls. If you see cudaMalloc during inference, you have a memory pool misconfiguration.

Tracing GPU Driver IOCTLs

The NVIDIA kernel driver communicates via ioctls. Tracing these reveals driver-level behavior:

// bpftrace script: nvidia_driver.bt
#!/usr/bin/env bpftrace

#include <linux/fs.h>

// Trace all ioctls to nvidia device files
tracepoint:syscalls:sys_enter_ioctl
/comm == "python" || comm == "python3"/
{
    @ioctl_start[tid] = nsecs;
    @ioctl_cmd[tid] = args->cmd;
}

tracepoint:syscalls:sys_exit_ioctl
/@ioctl_start[tid]/
{
    $latency = nsecs - @ioctl_start[tid];
    $cmd = @ioctl_cmd[tid];
    
    // Categorize by ioctl command
    @ioctl_latency[$cmd] = hist($latency / 1000);
    
    // Track high-latency ioctls
    if ($latency > 1000000) {  // > 1ms
        printf("Slow IOCTL: cmd=0x%x latency=%d µs\n", 
               $cmd, $latency / 1000);
    }
    
    delete(@ioctl_start[tid]);
    delete(@ioctl_cmd[tid]);
}

Diagnosing GPU Memory Pressure

When GPU memory is fragmented, cudaMalloc can take milliseconds:

// bpftrace script: cuda_memory_pressure.bt
#!/usr/bin/env bpftrace

uprobe:/usr/local/cuda/lib64/libcudart.so:cudaMalloc
{
    @alloc_start[tid] = nsecs;
    @alloc_size[tid] = arg1;  // Size parameter
}

uretprobe:/usr/local/cuda/lib64/libcudart.so:cudaMalloc
/@alloc_start[tid]/
{
    $latency = nsecs - @alloc_start[tid];
    $size = @alloc_size[tid];
    
    @malloc_latency_by_size[$size / 1048576] = hist($latency / 1000);
    
    // Flag slow allocations
    if ($latency > 100000) {  // > 100µs
        printf("SLOW cudaMalloc: %d MB took %d µs\n",
               $size / 1048576, $latency / 1000);
        print(ustack());
    }
    
    delete(@alloc_start[tid]);
    delete(@alloc_size[tid]);
}

uprobe:/usr/local/cuda/lib64/libcudart.so:cudaFree
{
    @free_count++;
}

interval:s:10
{
    printf("cudaFree calls in last 10s: %d\n", @free_count);
    @free_count = 0;
}

📊

cudaMalloc Latency by Allocation Size

Size	P50 Latency	P99 Latency	Note
< 1 MB	3 µs	15 µs	Pooled
1-10 MB	8 µs	45 µs	Pooled
10-100 MB	25 µs	180 µs	May fragment
100-500 MB	120 µs	2.5 ms	Likely compaction
> 500 MB	450 µs	15 ms	Compaction + defrag

Note: A100-80GB, measured during vLLM inference with dynamic batching

Tracing Python-CUDA Interaction

Python inference servers often have GIL contention issues. Trace the interaction:

// bpftrace script: python_cuda_interaction.bt
#!/usr/bin/env bpftrace

// Track when Python threads acquire/release GIL
uprobe:/usr/lib/libpython3.10.so:PyGILState_Ensure
{
    @gil_acquire_start[tid] = nsecs;
}

uretprobe:/usr/lib/libpython3.10.so:PyGILState_Ensure
/@gil_acquire_start[tid]/
{
    $latency = nsecs - @gil_acquire_start[tid];
    @gil_acquire_latency = hist($latency / 1000);
    
    if ($latency > 1000000) {
        printf("GIL contention: tid=%d waited %d µs\n", 
               tid, $latency / 1000);
    }
    delete(@gil_acquire_start[tid]);
}

// Correlate GIL holding time with CUDA calls
uprobe:/usr/lib/libpython3.10.so:PyGILState_Ensure
{
    @gil_holder = tid;
    @gil_acquired = nsecs;
}

uprobe:/usr/local/cuda/lib64/libcudart.so:cudaLaunchKernel
/@gil_holder == tid/
{
    // CUDA call while holding GIL - measure
    @cuda_with_gil++;
}

uprobe:/usr/local/cuda/lib64/libcudart.so:cudaStreamSynchronize
/@gil_holder == tid/
{
    // Blocking CUDA call while holding GIL - BAD
    @sync_with_gil++;
    printf("WARNING: cudaStreamSynchronize with GIL held (tid=%d)\n", tid);
}

END
{
    printf("CUDA calls with GIL: %d\n", @cuda_with_gil);
    printf("Sync calls with GIL (BAD): %d\n", @sync_with_gil);
}

🚨 GIL + cudaStreamSynchronize

Calling cudaStreamSynchronize while holding the Python GIL blocks all Python threads during GPU computation. This is a common source of inference latency spikes.

Tracing CPU Scheduling Effects

CPU scheduler decisions affect kernel launch latency:

// bpftrace script: scheduler_impact.bt
#!/usr/bin/env bpftrace

#include <linux/sched.h>

// Track when inference threads get scheduled off-CPU
tracepoint:sched:sched_switch
/comm == "python" || comm == "cuda-EvtHandlr"/
{
    @offcpu_start[tid] = nsecs;
    @offcpu_reason[tid] = args->prev_state;
}

tracepoint:sched:sched_switch
/@offcpu_start[args->next_pid]/
{
    $offcpu_time = nsecs - @offcpu_start[args->next_pid];
    
    @offcpu_latency = hist($offcpu_time / 1000);
    
    if ($offcpu_time > 1000000) {  // > 1ms
        printf("Thread %d off-CPU for %d µs (reason=%d)\n",
               args->next_pid, $offcpu_time / 1000,
               @offcpu_reason[args->next_pid]);
    }
    
    delete(@offcpu_start[args->next_pid]);
    delete(@offcpu_reason[args->next_pid]);
}

// Track CPU migrations (bad for cache)
tracepoint:sched:sched_migrate_task
/comm == "python"/
{
    @migrations++;
    printf("Migration: pid=%d from CPU%d to CPU%d\n",
           args->pid, args->orig_cpu, args->dest_cpu);
}

Complete Inference Request Tracing

Combining all traces to understand end-to-end latency:

// bpftrace script: inference_request.bt
#!/usr/bin/env bpftrace

// Requires application instrumentation via USDT probes
// Add to your Python code:
// import ctypes
// ctypes.CDLL(None).dtrace_probe_inference_start(request_id)

usdt:inference_server:inference_start
{
    @req_start[arg0] = nsecs;  // arg0 = request_id
    @req_tid[arg0] = tid;
}

// Track CUDA operations for this request
uprobe:/usr/local/cuda/lib64/libcudart.so:cudaLaunchKernel
/@req_tid[$request_id] == tid/
{
    @kernel_launches[@req_tid[$request_id]]++;
}

usdt:inference_server:inference_end
/@req_start[arg0]/
{
    $total_latency = nsecs - @req_start[arg0];
    $req_id = arg0;
    
    @inference_latency_ms = hist($total_latency / 1000000);
    
    printf("Request %d: %d ms, %d kernel launches\n",
           $req_id, $total_latency / 1000000,
           @kernel_launches[@req_tid[$req_id]]);
    
    delete(@req_start[$req_id]);
    delete(@req_tid[$req_id]);
    delete(@kernel_launches[@req_tid[$req_id]]);
}

Practical Debugging Session

Here’s a real debugging session for a latency spike issue:

# Step 1: Identify the spike pattern
$ sudo bpftrace -e '
  uprobe:/usr/local/cuda/lib64/libcudart.so:cudaStreamSynchronize {
    @start[tid] = nsecs;
  }
  uretprobe:/usr/local/cuda/lib64/libcudart.so:cudaStreamSynchronize
  /@start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000000;
    if ($lat > 50) {
      printf("%s: %d ms sync\n", strftime("%H:%M:%S", nsecs), $lat);
    }
    delete(@start[tid]);
  }
' -p $(pgrep -f vllm)

# Output shows periodic 200ms spikes every ~30 seconds
# 14:32:15: 203 ms sync
# 14:32:45: 198 ms sync

# Step 2: Correlate with memory operations
$ sudo bpftrace -e '
  uprobe:/usr/local/cuda/lib64/libcudart.so:cudaMalloc {
    @alloc_time = nsecs;
  }
  uretprobe:/usr/local/cuda/lib64/libcudart.so:cudaMalloc
  /@alloc_time/ {
    printf("cudaMalloc: %d µs\n", (nsecs - @alloc_time) / 1000);
  }
' -p $(pgrep -f vllm)

# Shows cudaMalloc calls right before spikes
# cudaMalloc: 156432 µs  <- 156ms allocation!

# Step 3: Get allocation stack trace
$ sudo bpftrace -e '
  uprobe:/usr/local/cuda/lib64/libcudart.so:cudaMalloc {
    @start = nsecs;
  }
  uretprobe:/usr/local/cuda/lib64/libcudart.so:cudaMalloc
  /@start/ {
    if ((nsecs - @start) > 100000000) {
      print(ustack(perf));
    }
  }
'
# Stack trace reveals: KV cache expansion during batch size increase

Latency Before/After Fix

(ms)

P50 Before

45 ms

P99 Before

203 ms

P50 After

42 ms

P99 After

58 ms

Root cause: Dynamic batch expansion was triggering KV cache reallocation. Fix: Pre-allocate KV cache for maximum batch size.

eBPF Overhead Measurement

# Measure tracing overhead
$ sudo bpftrace -e 'BEGIN { @start = nsecs; }
                    interval:s:10 { 
                      printf("Overhead: %d probes/s\n", @count/10); 
                      @count = 0; 
                    }
                    uprobe:/usr/local/cuda/lib64/libcudart.so:* { 
                      @count++; 
                    }'

# Typical output: 50,000-200,000 probes/second
# CPU overhead: 2-5% of one core

Conclusion

eBPF transforms GPU debugging from “GPU utilization is low” to “cudaStreamSynchronize blocked for 200ms because cudaMalloc was defragmenting after KV cache expansion.” This precision is essential for optimizing production inference systems.

Key traces to run regularly:

cudaMalloc during inference - Should be zero
cudaStreamSynchronize latency - Should be under 50ms P99
GIL contention - Should be under 1ms P99
Off-CPU time for inference threads - Should be under 100µs P99

The eBPF Advantage for GPU Profiling

Tracing CUDA Runtime Calls

Tracing GPU Driver IOCTLs

Diagnosing GPU Memory Pressure

cudaMalloc Latency by Allocation Size

Tracing Python-CUDA Interaction

Tracing CPU Scheduling Effects

Complete Inference Request Tracing

Practical Debugging Session

Latency Before/After Fix

eBPF Overhead Measurement

Conclusion

Fridays with Faraday

Related Posts

GPU Memory Profiling: Finding Leaks and Fragmentation

Writing Efficient CUDA Kernels: From Naive to Optimized

CUDA Graphs for Inference: Eliminating CPU Launch Overhead