llm inference expert
FlashAttention Through the Memory Hierarchy Lens
Analyzing FlashAttention's tiling strategy from an HBM bandwidth perspective. Includes roofline analysis, SRAM utilization measurements, and comparison with standard attention implementations.
llm inference advanced
KV Cache Quantization: Trading Precision for Throughput
Comprehensive analysis of FP8, INT8, and INT4 KV cache quantization techniques. Includes calibration strategies, accuracy measurements, and practical implementation guidance for production inference.
llm inference advanced
Speculative Decoding: Trading Compute for Latency
Implementation details of speculative decoding, including draft model selection, acceptance rate optimization, tree-structured speculation, and when speculative decoding helps vs hurts.