gpu programming intermediate

Showcase: Interactive Deep-Dives on Fridays with Faraday

A demonstration of the new high-performance technical blog features, including interactive GPU analysis and rigorous mathematical proofs.

January 6, 2026 (Updated: January 7, 2026)

2 min read

Stanley Phoong

Last updated: January 7, 2026

📓 Run in Google Colab

Welcome to the new Fridays with Faraday. We have upgraded the platform to support the rigorous needs of systems performance engineering.

1. Interactive Performance Analysis

Understanding if a kernel is memory-bound or compute-bound is easier with interactive tools. Use the slider below to see how arithmetic intensity shifts the “performance dot” along the roofline curve.

Native Roofline Model

Adjust Intensity: 10 FLOP/B

2. Rigorous Mathematical Notation

For complex performance proofs, we now use specialized Theorem and Proof containers to distinguish theory from implementation.

Σ Theorem: The Roofline Intersection

For a processor with peak performance $\pi$ (TFLOPS) and peak bandwidth $\beta$ (GB/s), the ridge point $I_{ridge}$ where an algorithm transitions from memory-bound to compute-bound is defined as: $I_{ridge} = \frac{\pi}{\beta}$

3. Reproducibility & Colab Integration

Notice the “Run in Google Colab” button at the top of this post. It dynamically links to the Jupyter Notebook associated with this technical analysis, allowing you to run the benchmarks yourself.

4. Reading Progress

As you scroll down this post, watch the accent-blue bar at the very top of your browser. It provides visual feedback for long-form technical deep-dives (especially useful for our 25+ minute “expert” series).

5. Multi-Author Support

At the bottom of this page, you will see a dynamic Author Bio. By moving authors to a separate data collection, we now support guest contributions from the wider performance engineering community.

What’s next? Try the new Search feature in the header to find other posts involving CUDA or Registers.

Stanley Phoong

Performance engineer obsessed with every microsecond. Specializing in vLLM internals and bare-metal microcontroller optimization.

1. Interactive Performance Analysis

Native Roofline Model

2. Rigorous Mathematical Notation

3. Reproducibility & Colab Integration

4. Reading Progress

5. Multi-Author Support

Stanley Phoong

Related Posts

CUDA Graphs for Inference: Eliminating CPU Launch Overhead

Writing Efficient CUDA Kernels: From Naive to Optimized

Habana Gaudi2 Memory Subsystem: Optimization Strategies for LLM Inference