⚡ Fridays with Faraday
  • Posts
  • Categories
  • Series
  • About
  • Posts
  • Categories
  • Series
  • About
ESC

Start typing to search...

🧠

LLM Inference

Large language model inference optimization and transformer performance.

3 articles

llm inference expert

FlashAttention Through the Memory Hierarchy Lens

Analyzing FlashAttention's tiling strategy from an HBM bandwidth perspective. Includes roofline analysis, SRAM utilization measurements, and comparison with standard attention implementations.

Nov 13, 2024 · 28 min read
#flashattention#gpu-memory#attention +2
llm inference advanced

KV Cache Quantization: Trading Precision for Throughput

Comprehensive analysis of FP8, INT8, and INT4 KV cache quantization techniques. Includes calibration strategies, accuracy measurements, and practical implementation guidance for production inference.

Nov 9, 2024 · 20 min read
#quantization#kv-cache#fp8 +2
llm inference advanced

Speculative Decoding: Trading Compute for Latency

Implementation details of speculative decoding, including draft model selection, acceptance rate optimization, tree-structured speculation, and when speculative decoding helps vs hurts.

Nov 8, 2024 · 18 min read
#speculative-decoding#latency#inference +1
← All Categories
⚡ Fridays with Faraday

Deep technical explorations in systems performance optimization, from bare-metal microcontrollers to large-scale LLM inference systems.

Categories

  • Microcontrollers
  • vLLM
  • LLM Inference
  • Hardware
  • Profiling
  • GPU Programming

Resources

  • About
  • RSS Feed
  • Sitemap
  • GitHub

© 2025 Fridays with Faraday. Built with Astro.

"Measure. Optimize. Repeat."