⚡ Performance Engineering

Deep Technical Explorations in Systems Optimization

From bare-metal microcontroller registers to large-scale LLM inference pipelines. We obsess over every microsecond, every cache miss, every memory access pattern.

Browse All Posts About This Blog

perf.sh

 $ perf stat -e cycles,instructions,cache-misses ./inference
 # Performance counter stats:
 42,891,234,567 cycles
 98,234,567,890 instructions # 2.29 IPC
 1,234,567 cache-misses # 0.001%

 ✓ 3.2x throughput improvement

Explore by Topic

Recent Posts

View all

gpu programming expert

Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination

Complete treatment of atomic operations and reduction patterns in CUDA. Covers atomicAdd contention and throughput, warp-level reductions using __shfl_xor_sync, block-level reductions with shared memory, multi-block reductions with global memory coordination, and full implementations of a histogram kernel and parallel prefix sum.

Mar 22, 2025 · 38 min read

#cuda#atomics#reduction +5

gpu programming expert

Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation

Complete technical guide to CUDA Cooperative Groups — thread_block, thread_block_tile, coalesced_groups, grid_group, and multi_grid_group. Covers tiled partitions for warp-level programming, cooperative kernel launches, grid-wide synchronization, and benchmarked implementations of reductions and scans using cooperative groups.

Mar 22, 2025 · 35 min read

#cuda#cooperative-groups#synchronization +4

gpu programming expert

CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling

Complete treatment of CUDA Graphs for LLM inference optimization. Covers graph capture mechanics, replay for single-launch kernel sequences, memory pre-allocation constraints, handling dynamic shapes via multi-graph caching, vLLM's graph capture strategy for different batch sizes, and implementation of a forward pass captured as a CUDA graph.

Mar 22, 2025 · 36 min read

#cuda-graphs#cuda#performance +5

gpu programming expert

Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style

Step-by-step construction of a custom multi-head attention CUDA kernel — from the naive O(N^2) materialization approach, through tiled attention with shared memory, to the online softmax technique used in FlashAttention. Each version benchmarked for memory usage and throughput on an A100.

Mar 22, 2025 · 38 min read

#cuda#attention#flash-attention +5

gpu programming expert

CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library

Comprehensive guide to implementing high-performance GEMM kernels using NVIDIA CUTLASS. Covers the CUTLASS template hierarchy (Gemm, threadblock, warp, instruction tiles), tile scheduling strategies, epilogue functors, mixed-precision GEMM configuration, the relationship between CUTLASS templates and hardware tensor core instructions, and performance tuning methodology with Nsight Compute profiling.

Mar 22, 2025 · 38 min read

#cuda#cutlass#gemm +5

gpu programming expert

Debugging CUDA: compute-sanitizer, cuda-gdb, Common Errors, and Race Condition Detection

Comprehensive guide to debugging CUDA programs — compute-sanitizer for memory errors and race conditions, cuda-gdb for interactive debugging, common error codes and their root causes, race condition detection with racecheck, synchronization errors, and systematic debugging workflows for production CUDA code.

Mar 22, 2025 · 35 min read

#cuda#debugging#compute-sanitizer +5

gpu programming expert

Dynamic Parallelism: Launching Kernels from Kernels and When It Actually Helps

A thorough technical guide to CUDA Dynamic Parallelism — device-side kernel launches, nested execution, memory visibility, synchronization semantics, performance characteristics, and the narrow set of problems where device-side launches outperform host-driven alternatives.

Mar 22, 2025 · 35 min read

#cuda#dynamic-parallelism#nested-kernels +3

gpu programming expert

Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion

A systematic treatment of CUDA kernel fusion patterns for LLM inference. Covers why fusion eliminates HBM round-trips, elementwise fusion (bias+activation+dropout in one kernel), reduction fusion (LayerNorm as a single kernel), GEMM epilogue fusion (bias+activation after matmul), attention fusion (FlashAttention), and a complete implementation of a fused bias+GELU kernel.

Mar 22, 2025 · 37 min read

#cuda#kernel-fusion#elementwise +5

gpu programming expert

Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free

Step-by-step optimization of matrix transpose in CUDA — naive global memory, coalesced read with scattered write, tiled shared memory, bank-conflict-free padding, vectorized loads, and diagonal block ordering. Each version is benchmarked to show exactly where each optimization gains throughput.

Mar 22, 2025 · 35 min read

#cuda#matrix-transpose#shared-memory +4

Stay Updated

Get notified when new deep-dive technical articles are published.

Subscribe via RSS