Showcase: Interactive Deep-Dives on Fridays with Faraday
A demonstration of the new high-performance technical blog features, including interactive GPU analysis and rigorous mathematical proofs.
From bare-metal microcontroller registers to large-scale LLM inference pipelines. We obsess over every microsecond, every cache miss, every memory access pattern.
$ perf stat -e cycles,instructions,cache-misses ./inference
# Performance counter stats:
42,891,234,567 cycles
98,234,567,890 instructions # 2.29 IPC
1,234,567 cache-misses # 0.001%
✓ 3.2x throughput improvement A demonstration of the new high-performance technical blog features, including interactive GPU analysis and rigorous mathematical proofs.
A deep technical analysis of PagedAttention's memory management, including KV cache fragmentation analysis, page table implementation details, and performance implications at the GPU memory hierarchy level.
A systematic investigation into ESP32 power domains, RTC memory retention, and peripheral leakage. Includes register-level configurations and oscilloscope measurements proving sub-10µA deep sleep.
Analyzing FlashAttention's tiling strategy from an HBM bandwidth perspective. Includes roofline analysis, SRAM utilization measurements, and comparison with standard attention implementations.
Using BPFtrace and custom eBPF programs to trace CUDA runtime behavior, understand GPU scheduling latencies, and diagnose inference performance issues that nvidia-smi can't reveal.
Deep dive into Gaudi2's HBM architecture, SRAM hierarchy, and TPC memory access patterns. Practical optimization techniques for maximizing memory bandwidth utilization in transformer workloads.
A detailed analysis of continuous batching algorithms, including iteration-level scheduling, preemption strategies, and the interaction between scheduler and memory manager in production inference systems.
Comprehensive analysis of FP8, INT8, and INT4 KV cache quantization techniques. Includes calibration strategies, accuracy measurements, and practical implementation guidance for production inference.
Deep dive into CUDA graph capture, replay, and the specific challenges of applying graphs to dynamic LLM inference workloads. Includes capture strategies and performance measurements.