⚡ Performance Engineering

Deep Technical Explorations in Systems Optimization

From bare-metal microcontroller registers to large-scale LLM inference pipelines. We obsess over every microsecond, every cache miss, every memory access pattern.

perf.sh
$ perf stat -e cycles,instructions,cache-misses ./inference
# Performance counter stats:
42,891,234,567 cycles
98,234,567,890 instructions # 2.29 IPC
1,234,567 cache-misses # 0.001%

✓ 3.2x throughput improvement

Recent Posts

View all
gpu programming expert

Atomics and Advanced Reductions: Global Atomics, Warp Reductions, and Multi-Block Coordination

Complete treatment of atomic operations and reduction patterns in CUDA. Covers atomicAdd contention and throughput, warp-level reductions using __shfl_xor_sync, block-level reductions with shared memory, multi-block reductions with global memory coordination, and full implementations of a histogram kernel and parallel prefix sum.

· 38 min read
#cuda#atomics#reduction +5
gpu programming expert

Cooperative Groups: Sub-Warp Tiles, Block Synchronization, and Grid-Level Cooperation

Complete technical guide to CUDA Cooperative Groups — thread_block, thread_block_tile, coalesced_groups, grid_group, and multi_grid_group. Covers tiled partitions for warp-level programming, cooperative kernel launches, grid-wide synchronization, and benchmarked implementations of reductions and scans using cooperative groups.

· 35 min read
#cuda#cooperative-groups#synchronization +4
gpu programming expert

CUDA Graphs: Capture, Replay, Memory Management, and Dynamic Shape Handling

Complete treatment of CUDA Graphs for LLM inference optimization. Covers graph capture mechanics, replay for single-launch kernel sequences, memory pre-allocation constraints, handling dynamic shapes via multi-graph caching, vLLM's graph capture strategy for different batch sizes, and implementation of a forward pass captured as a CUDA graph.

· 36 min read
#cuda-graphs#cuda#performance +5
gpu programming expert

CUTLASS GEMM Templates: Writing High-Performance Matrix Multiply with NVIDIA's Template Library

Comprehensive guide to implementing high-performance GEMM kernels using NVIDIA CUTLASS. Covers the CUTLASS template hierarchy (Gemm, threadblock, warp, instruction tiles), tile scheduling strategies, epilogue functors, mixed-precision GEMM configuration, the relationship between CUTLASS templates and hardware tensor core instructions, and performance tuning methodology with Nsight Compute profiling.

· 38 min read
#cuda#cutlass#gemm +5
gpu programming expert

Kernel Fusion Patterns: Elementwise, Reduction, GEMM Epilogue, and Attention Fusion

A systematic treatment of CUDA kernel fusion patterns for LLM inference. Covers why fusion eliminates HBM round-trips, elementwise fusion (bias+activation+dropout in one kernel), reduction fusion (LayerNorm as a single kernel), GEMM epilogue fusion (bias+activation after matmul), attention fusion (FlashAttention), and a complete implementation of a fused bias+GELU kernel.

· 37 min read
#cuda#kernel-fusion#elementwise +5

Stay Updated

Get notified when new deep-dive technical articles are published.

Subscribe via RSS