⚡ Performance Engineering

Deep Technical Explorations in Systems Optimization

From bare-metal microcontroller registers to large-scale LLM inference pipelines. We obsess over every microsecond, every cache miss, every memory access pattern.

Browse All Posts About This Blog

perf.sh

 $ perf stat -e cycles,instructions,cache-misses ./inference
 # Performance counter stats:
 42,891,234,567 cycles
 98,234,567,890 instructions # 2.29 IPC
 1,234,567 cache-misses # 0.001%

 ✓ 3.2x throughput improvement

Explore by Topic

Recent Posts

View all

gpu programming intermediate

Showcase: Interactive Deep-Dives on Fridays with Faraday

A demonstration of the new high-performance technical blog features, including interactive GPU analysis and rigorous mathematical proofs.

Jan 6, 2026 · 2 min read

#interactive#tutorial#meta

vllm expert

Dissecting vLLM's PagedAttention: A Memory-Level Analysis

A deep technical analysis of PagedAttention's memory management, including KV cache fragmentation analysis, page table implementation details, and performance implications at the GPU memory hierarchy level.

Nov 15, 2024 · 25 min read

#vllm#pagedattention#kv-cache +2

microcontrollers expert

Achieving Sub-10µA Sleep Current on ESP32: Register-Level Analysis

A systematic investigation into ESP32 power domains, RTC memory retention, and peripheral leakage. Includes register-level configurations and oscilloscope measurements proving sub-10µA deep sleep.

Nov 14, 2024 · 22 min read

#esp32#power-management#low-power +2

llm inference expert

FlashAttention Through the Memory Hierarchy Lens

Analyzing FlashAttention's tiling strategy from an HBM bandwidth perspective. Includes roofline analysis, SRAM utilization measurements, and comparison with standard attention implementations.

Nov 13, 2024 · 28 min read

#flashattention#gpu-memory#attention +2

profiling expert

Production LLM Profiling with eBPF: Beyond nvidia-smi

Using BPFtrace and custom eBPF programs to trace CUDA runtime behavior, understand GPU scheduling latencies, and diagnose inference performance issues that nvidia-smi can't reveal.

Nov 12, 2024 · 24 min read

#ebpf#bpftrace#cuda +3

gpu programming expert

Habana Gaudi2 Memory Subsystem: Optimization Strategies for LLM Inference

Deep dive into Gaudi2's HBM architecture, SRAM hierarchy, and TPC memory access patterns. Practical optimization techniques for maximizing memory bandwidth utilization in transformer workloads.

Nov 11, 2024 · 26 min read

#gaudi#habana#hpu +3

vllm expert

Implementing Continuous Batching: From Scheduling Theory to vLLM Practice

A detailed analysis of continuous batching algorithms, including iteration-level scheduling, preemption strategies, and the interaction between scheduler and memory manager in production inference systems.

Nov 10, 2024 · 23 min read

#vllm#continuous-batching#scheduling +2

llm inference advanced

KV Cache Quantization: Trading Precision for Throughput

Comprehensive analysis of FP8, INT8, and INT4 KV cache quantization techniques. Includes calibration strategies, accuracy measurements, and practical implementation guidance for production inference.

Nov 9, 2024 · 20 min read

#quantization#kv-cache#fp8 +2

gpu programming advanced

CUDA Graphs for Inference: Eliminating CPU Launch Overhead

Deep dive into CUDA graph capture, replay, and the specific challenges of applying graphs to dynamic LLM inference workloads. Includes capture strategies and performance measurements.

Nov 8, 2024 · 18 min read

#cuda#cuda-graphs#inference +2

Stay Updated

Get notified when new deep-dive technical articles are published.

Subscribe via RSS