distributed systems expert
Tensor Parallelism Implementation: AllReduce Patterns and Efficiency
Detailed analysis of tensor parallelism for multi-GPU inference, including column/row splitting strategies, AllReduce optimization, and practical implementation considerations.
distributed systems intermediate
Request Routing for LLM Inference: Load Balancing Strategies
Analysis of load balancing algorithms for multi-replica LLM serving, including least-connections, weighted routing, and queue-depth-aware strategies.