gpu programming intermediate
Showcase: Interactive Deep-Dives on Fridays with Faraday
A demonstration of the new high-performance technical blog features, including interactive GPU analysis and rigorous mathematical proofs.
gpu programming expert
Habana Gaudi2 Memory Subsystem: Optimization Strategies for LLM Inference
Deep dive into Gaudi2's HBM architecture, SRAM hierarchy, and TPC memory access patterns. Practical optimization techniques for maximizing memory bandwidth utilization in transformer workloads.
gpu programming advanced
CUDA Graphs for Inference: Eliminating CPU Launch Overhead
Deep dive into CUDA graph capture, replay, and the specific challenges of applying graphs to dynamic LLM inference workloads. Includes capture strategies and performance measurements.
gpu programming advanced
Writing Efficient CUDA Kernels: From Naive to Optimized
Step-by-step optimization of a CUDA kernel using memory coalescing, shared memory, occupancy tuning, and instruction-level parallelism.