⚡ Fridays with Faraday
  • Posts
  • Categories
  • Series
  • About
  • Posts
  • Categories
  • Series
  • About
ESC

Start typing to search...

All Posts

18 articles on systems performance and optimization

vllm expert

Dissecting vLLM's PagedAttention: A Memory-Level Analysis

A deep technical analysis of PagedAttention's memory management, including KV cache fragmentation analysis, page table implementation details, and performance implications at the GPU memory hierarchy level.

Nov 15, 2024 · 25 min read
#vllm#pagedattention#kv-cache +2
microcontrollers expert

Achieving Sub-10µA Sleep Current on ESP32: Register-Level Analysis

A systematic investigation into ESP32 power domains, RTC memory retention, and peripheral leakage. Includes register-level configurations and oscilloscope measurements proving sub-10µA deep sleep.

Nov 14, 2024 · 22 min read
#esp32#power-management#low-power +2
llm inference expert

FlashAttention Through the Memory Hierarchy Lens

Analyzing FlashAttention's tiling strategy from an HBM bandwidth perspective. Includes roofline analysis, SRAM utilization measurements, and comparison with standard attention implementations.

Nov 13, 2024 · 28 min read
#flashattention#gpu-memory#attention +2
profiling expert

Production LLM Profiling with eBPF: Beyond nvidia-smi

Using BPFtrace and custom eBPF programs to trace CUDA runtime behavior, understand GPU scheduling latencies, and diagnose inference performance issues that nvidia-smi can't reveal.

Nov 12, 2024 · 24 min read
#ebpf#bpftrace#cuda +3
gpu programming expert

Habana Gaudi2 Memory Subsystem: Optimization Strategies for LLM Inference

Deep dive into Gaudi2's HBM architecture, SRAM hierarchy, and TPC memory access patterns. Practical optimization techniques for maximizing memory bandwidth utilization in transformer workloads.

Nov 11, 2024 · 26 min read
#gaudi#habana#hpu +3
vllm expert

Implementing Continuous Batching: From Scheduling Theory to vLLM Practice

A detailed analysis of continuous batching algorithms, including iteration-level scheduling, preemption strategies, and the interaction between scheduler and memory manager in production inference systems.

Nov 10, 2024 · 23 min read
#vllm#continuous-batching#scheduling +2
llm inference advanced

KV Cache Quantization: Trading Precision for Throughput

Comprehensive analysis of FP8, INT8, and INT4 KV cache quantization techniques. Includes calibration strategies, accuracy measurements, and practical implementation guidance for production inference.

Nov 9, 2024 · 20 min read
#quantization#kv-cache#fp8 +2
gpu programming advanced

CUDA Graphs for Inference: Eliminating CPU Launch Overhead

Deep dive into CUDA graph capture, replay, and the specific challenges of applying graphs to dynamic LLM inference workloads. Includes capture strategies and performance measurements.

Nov 8, 2024 · 18 min read
#cuda#cuda-graphs#inference +2
llm inference advanced

Speculative Decoding: Trading Compute for Latency

Implementation details of speculative decoding, including draft model selection, acceptance rate optimization, tree-structured speculation, and when speculative decoding helps vs hurts.

Nov 8, 2024 · 18 min read
#speculative-decoding#latency#inference +1
transformers advanced

Attention Variants Compared: MHA, MQA, GQA, and MLA

Technical comparison of Multi-Head Attention, Multi-Query Attention, Grouped-Query Attention, and Multi-head Latent Attention. Analysis of memory-compute trade-offs and implementation considerations.

Nov 7, 2024 · 22 min read
#attention#mha#mqa +4
microcontrollers advanced

ARM Cortex-M4 DSP Instructions: Practical Audio Processing

Leveraging SIMD instructions on Cortex-M4 for real-time audio processing. Includes cycle-accurate analysis, CMSIS-DSP usage, and hand-optimized assembly for FIR filters.

Nov 7, 2024 · 19 min read
#arm#cortex-m4#dsp +3
transformers advanced

Grouped Query Attention: Memory-Throughput Trade-offs

Analysis of GQA's KV cache reduction mechanism, optimal group sizes, and performance implications for inference at scale. Includes benchmarks across different model sizes.

Nov 6, 2024 · 16 min read
#attention#gqa#mha +3
distributed systems expert

Tensor Parallelism Implementation: AllReduce Patterns and Efficiency

Detailed analysis of tensor parallelism for multi-GPU inference, including column/row splitting strategies, AllReduce optimization, and practical implementation considerations.

Nov 5, 2024 · 21 min read
#tensor-parallelism#distributed#multi-gpu +2
profiling intermediate

GPU Memory Profiling: Finding Leaks and Fragmentation

Practical techniques for diagnosing GPU memory issues using PyTorch memory profiling APIs, including allocation tracking, fragmentation analysis, and memory snapshot debugging.

Nov 4, 2024 · 15 min read
#memory#profiling#pytorch +2
microcontrollers intermediate

I2C Bus Optimization: Achieving 1MHz on Noisy Lines

Practical techniques for reliable high-speed I2C communication, including rise time analysis, pull-up resistor calculations, and noise immunity improvements.

Nov 3, 2024 · 14 min read
#i2c#embedded#signal-integrity +2
gpu programming advanced

Writing Efficient CUDA Kernels: From Naive to Optimized

Step-by-step optimization of a CUDA kernel using memory coalescing, shared memory, occupancy tuning, and instruction-level parallelism.

Nov 2, 2024 · 20 min read
#cuda#kernel#optimization +2
distributed systems intermediate

Request Routing for LLM Inference: Load Balancing Strategies

Analysis of load balancing algorithms for multi-replica LLM serving, including least-connections, weighted routing, and queue-depth-aware strategies.

Nov 1, 2024 · 14 min read
#load-balancing#routing#inference +2
transformers advanced

RoPE Embeddings: Implementation and Long Context Scaling

Understanding Rotary Position Embeddings, their efficient implementation, and techniques for extending context length including YaRN and Dynamic NTK scaling.

Oct 31, 2024 · 17 min read
#rope#positional-encoding#transformers +2

Categories

  • vllm 2
  • microcontrollers 3
  • llm inference 3
  • profiling 2
  • gpu programming 3
  • transformers 3
  • distributed systems 2

Popular Tags

#inference #optimization #cuda #memory #kv-cache #attention #transformers #vllm #gpu-memory #embedded #profiling #performance #latency #mha #mqa
⚡ Fridays with Faraday

Deep technical explorations in systems performance optimization, from bare-metal microcontrollers to large-scale LLM inference systems.

Categories

  • Microcontrollers
  • vLLM
  • LLM Inference
  • Hardware
  • Profiling
  • GPU Programming

Resources

  • About
  • RSS Feed
  • Sitemap
  • GitHub

© 2025 Fridays with Faraday. Built with Astro.

"Measure. Optimize. Repeat."