gpu programming expert
Habana Gaudi2 Memory Subsystem: Optimization Strategies for LLM Inference
Deep dive into Gaudi2's HBM architecture, SRAM hierarchy, and TPC memory access patterns. Practical optimization techniques for maximizing memory bandwidth utilization in transformer workloads.
gpu programming advanced
CUDA Graphs for Inference: Eliminating CPU Launch Overhead
Deep dive into CUDA graph capture, replay, and the specific challenges of applying graphs to dynamic LLM inference workloads. Includes capture strategies and performance measurements.
gpu programming advanced
Writing Efficient CUDA Kernels: From Naive to Optimized
Step-by-step optimization of a CUDA kernel using memory coalescing, shared memory, occupancy tuning, and instruction-level parallelism.