Part of Series GPU Hardware & AI Accelerators 19 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

Intel Gaudi (designed by Habana Labs, acquired by Intel in 2019 for $2 billion) is the third major AI accelerator architecture after NVIDIA GPUs and Google TPUs. Gaudi’s defining characteristic is architectural separation: the Matrix Math Engine (MME) handles GEMM operations, while 24 Tensor Processing Cores (TPCs) handle everything else — elementwise operations, reductions, data transformations, custom operators. Both units operate independently with dedicated memory interfaces, allowing full overlap of matrix computation and non-matrix computation.

The other distinguishing feature is integrated networking. Each Gaudi chip includes 24 100-Gbps RDMA (Remote Direct Memory Access) NICs on-die — no external InfiniBand HCA or NVLink cables. A Gaudi server connects its accelerators and inter-server fabric using the same on-chip network interfaces, eliminating a major cost and power component of GPU training clusters.

This post covers the Gaudi hardware architecture (Gaudi 1, 2, and 3), the TPC programmable core, the Graph Compiler execution model, the integrated networking approach, the Synapse AI software stack, and a quantitative ROI comparison against NVIDIA GPUs.

Architecture Overview

📊

Gaudi Generational Comparison

SpecificationGaudi 1Gaudi 2Gaudi 3
Release year 2019 2022 2024
Process 16nm (TSMC) 7nm (TSMC) 5nm (TSMC)
MME count 8 2 2 (larger)
TPC count 8 24 64
BF16 GEMM TFLOPS N/A 432 1,835
FP8 GEMM TFLOPS N/A 865 3,670
HBM type HBM2 (32 GB) HBM2e (96 GB) HBM2e (128 GB)
HBM bandwidth 1,000 GB/s 2,450 GB/s 3,700 GB/s
SRAM (on-chip) 50 MB 48 MB 96 MB
Network 10x 100G RoCE 24x 100G RoCE 24x 200G RoCE
Network BW per chip 125 GB/s 300 GB/s 600 GB/s
TDP ~300 W ~600 W ~900 W
Note: Gaudi 3 closes the gap with H100 on compute (1,835 vs 990 BF16 TFLOPS) and bandwidth (3,700 vs 3,350 GB/s). The 24x200G integrated networking provides 600 GB/s — comparable to NVLink 4.0's 900 GB/s.

The Matrix Math Engine (MME)

Architecture

The MME is Gaudi’s dedicated matrix multiplication engine. On Gaudi 2, there are 2 MME units; on Gaudi 3, 2 larger MME units:

// Gaudi 2 MME:
// 2 MMEs, each is a systolic-array-style matrix engine
// Supported precisions: FP32, TF32, BF16, FP16, FP8
// Peak throughput (per chip, both MMEs):
//   BF16: 432 TFLOPS
//   FP8: 865 TFLOPS
//
// Gaudi 3 MME:
// 2 MMEs with ~4x throughput increase
// BF16: 1,835 TFLOPS
// FP8: 3,670 TFLOPS
//
// MME operation:
// 1. Graph compiler tiles the GEMM into MME-sized tiles
// 2. DMA engine loads A and B tiles from HBM into MME input buffers
// 3. MME computes tile product C += A × B
// 4. DMA engine stores C tile back to HBM or SRAM
// 5. MME and DMA overlap: while computing tile N, loading tile N+1

MME vs GPU Tensor Core

// Key architectural difference:
// GPU: tensor cores are part of the SM, share register file and schedulers
// Gaudi: MME is a completely independent unit with its own memory interface
//
// Advantage: MME can run at 100% utilization while TPCs do other work
// Disadvantage: data must be explicitly moved between MME and TPC address spaces
//
// GPU analogy:
// Imagine if the H100's tensor cores were a separate chip with their own HBM ports
// GEMM runs on that chip while CUDA cores do softmax on the GPU proper
// No resource contention between matrix and non-matrix operations

The Tensor Processing Core (TPC)

Programmable VLIW-SIMD Core

The TPC is Gaudi’s programmable compute core — the equivalent of CUDA cores, but designed specifically for tensor operations:

// TPC architecture:
// - VLIW (Very Long Instruction Word): issues multiple operations per cycle
// - SIMD: 256 elements wide (2048-bit data path)
// - Each TPC has:
//   - Scalar unit: address computation, loop control
//   - Vector unit (256-wide): elementwise operations
//   - Load/Store unit: HBM and SRAM access
//   - Special function unit: exp, log, rsqrt, tanh
//
// TPC instruction format (VLIW packet):
// [Scalar op] [Vector op 1] [Vector op 2] [Load/Store] [SFU op]
// All 5 slots can execute simultaneously in one cycle

// Gaudi 2: 24 TPCs × 256-wide = 6,144 parallel lanes
// Gaudi 3: 64 TPCs × 256-wide = 16,384 parallel lanes

// Example TPC operation (pseudocode):
// Each TPC processes a 256-element chunk:
//   for each 256-element chunk in tensor:
//     vec = load(chunk)        // LD/ST slot
//     vec = vec * scale        // Vector slot 1
//     vec = vec + bias         // Vector slot 2
//     vec = exp(vec)           // SFU slot
//     idx = idx + 256          // Scalar slot (loop increment)
//     store(out_chunk, vec)    // LD/ST slot (next cycle)

Custom TPC Kernels

Unlike TPU (which requires XLA compilation for all operations), Gaudi allows custom TPC kernels:

// TPC kernel example: custom activation function
// Written in TPC-C (a C-like language for TPC)
void custom_activation_kernel(
    tensor input,   // Input tensor (in HBM or SRAM)
    tensor output,  // Output tensor
    float alpha     // Parameter
) {
    // TPC processes 256 elements per cycle
    int chunk_size = 256;
    int total = get_dim_size(input, 0);

    for (int i = get_index_space_offset(); i < total; i += get_index_space_size()) {
        // Load 256 elements
        float256 vec = v_f32_ld_tnsr(i, input);

        // Custom computation: leaky-swish hybrid
        float256 sigmoid_vec = v_f32_rsqrt(v_f32_add(1.0f, v_f32_exp(v_f32_neg(vec))));
        float256 positive = v_f32_mul(vec, sigmoid_vec);
        float256 negative = v_f32_mul(vec, alpha);

        // Select based on sign
        bool256 mask = v_f32_cmp(GT, vec, 0.0f);
        float256 result = v_f32_sel(mask, positive, negative);

        // Store 256 elements
        v_f32_st_tnsr(i, output, result);
    }
}
ℹ️ TPC Programmability Is Gaudi's Differentiator

The TPC bridges the gap between GPU programmability and TPU efficiency. On a GPU, custom operations are written in CUDA and share the SM with GEMM (causing resource contention). On a TPU, custom operations must go through XLA (limiting flexibility). On Gaudi, custom operations run on dedicated TPC cores while GEMM runs independently on the MME — full parallelism with full programmability.

Graph Compiler Execution Model

How the Graph Compiler Works

Gaudi’s Graph Compiler takes a computation graph (from PyTorch or TensorFlow) and produces an optimized execution plan:

// Graph Compiler pipeline:
// 1. CAPTURE: intercept PyTorch operations to build a computation graph
//    Linear → GEMM node
//    ReLU → TPC elementwise node
//    LayerNorm → TPC reduction + elementwise nodes
//    Softmax → TPC exp + reduce + div nodes

// 2. OPTIMIZE: graph-level transformations
//    - Operator fusion: merge consecutive TPC operations into single kernels
//    - Layout optimization: transpose/permute tensors for optimal memory access
//    - Memory planning: determine HBM ↔ SRAM allocation for intermediates
//    - Scheduling: assign nodes to MME or TPC, overlap execution

// 3. SCHEDULE: determine execution order
//    - MME operations form one pipeline
//    - TPC operations form another pipeline
//    - DMA transfers form a third pipeline
//    - All three execute concurrently with explicit synchronization points

// 4. CODEGEN: generate device code for MME, TPC, and DMA
//    - MME: tile sizes, accumulation order
//    - TPC: fused kernel code (TPC-C compiled to TPC binary)
//    - DMA: transfer descriptors (source, destination, size)

Overlapping MME and TPC

The key scheduling optimization: overlap GEMM (MME) with non-GEMM operations (TPC):

// Transformer layer execution on Gaudi:
//
// Timeline:
// MME: [QKV GEMM]───[Attn Score GEMM]───[AttnV GEMM]───[Output GEMM]───[FFN1 GEMM]───[FFN2 GEMM]
// TPC:         [Q,K,V reshape]──[Softmax]──[O reshape]──[LayerNorm]──────[GeLU]──────[LayerNorm]
// DMA: [Load W_qkv]──────[Load W_out]──────[Load W_ffn1]──────[Load W_ffn2]
//
// MME is always busy computing the next GEMM
// TPC processes the previous GEMM's output simultaneously
// DMA prefetches the next GEMM's weights
// All three engines overlap — zero idle time in the ideal case
//
// On a GPU:
// GEMM (tensor cores) and softmax (CUDA cores) share the SM
// Running softmax on CUDA cores can reduce tensor core utilization
// FlashAttention solves this by fusing attention into a single kernel
// But the fundamental resource contention exists

Engine Utilization: Gaudi 2 vs H100 on Transformer Layer

(% of peak)
Gaudi 2 MME utilization 88% — independent engine, no contention
88 % of peak
Gaudi 2 TPC utilization 72% — overlaps with MME
72 % of peak
H100 tensor core utilization 78% — shares SM with other ops
78 % of peak
H100 CUDA core utilization 35% — underutilized during GEMM
35 % of peak

Integrated Networking

On-Die RDMA NICs

Each Gaudi chip has 24 independent 100 Gbps (Gaudi 2) or 200 Gbps (Gaudi 3) RDMA NICs integrated directly on the accelerator die:

// Gaudi 2 networking:
// 24 × 100 Gbps RoCE (RDMA over Converged Ethernet) v2 ports
// Total: 2,400 Gbps = 300 GB/s per chip
//
// Gaudi 3 networking:
// 24 × 200 Gbps RoCE v2 ports
// Total: 4,800 Gbps = 600 GB/s per chip
//
// Configuration flexibility:
// 8-chip server (Gaudi 2):
//   21 ports for intra-server (all-to-all between 8 chips)
//   3 ports for inter-server (to network switch)
//   Intra-server BW per chip: 21 × 100 Gbps = 262.5 GB/s
//   Inter-server BW per chip: 3 × 100 Gbps = 37.5 GB/s
//
// Alternatively, for scale-out:
//   All 24 ports to network switch
//   Per-chip: 300 GB/s to fabric
//   Requires top-of-rack switch with sufficient ports

Networking Cost Advantage

// GPU server networking cost:
// 8x H100 GPUs: NVLink + NVSwitch (included in DGX price)
// Inter-server: 8x ConnectX-7 HCAs (~$2,000 each = $16,000)
// InfiniBand switch: ~$30,000-60,000 per 40-port switch
// Cables: ~$500 each × 8 = $4,000
// Total networking: ~$50,000-80,000 per 8-GPU server

// Gaudi server networking cost:
// 8x Gaudi chips: RDMA NICs included on-die (no separate HCA)
// Ethernet switch: ~$10,000-20,000 per 48-port switch
// Cables: ~$200 each × 24 = $4,800
// Total networking: ~$15,000-25,000 per 8-chip server
//
// Savings: $35,000-55,000 per server (40-70% reduction in networking cost)
Integrated Networking Eliminates a Bottleneck

On GPU servers, the InfiniBand HCA connects to the GPU via PCIe. GPU-to-NIC data must traverse the PCIe bus (64 GB/s), creating a bottleneck for scale-out training. On Gaudi, the RDMA NIC reads directly from the accelerator’s HBM — no PCIe traversal. This eliminates the GPU-NIC PCIe bottleneck and allows full HBM bandwidth to feed the network.

Synapse AI Software Stack

Architecture

// Synapse AI stack (bottom to top):
// 1. Device drivers and firmware (HAL)
// 2. Graph Compiler (optimization, scheduling, code generation)
// 3. Runtime (memory management, execution, synchronization)
// 4. Collective communication library (HCCL — Habana CCL)
// 5. Framework bridges:
//    - PyTorch: Habana HPU backend (torch.hpu)
//    - Hugging Face Optimum-Habana
//    - DeepSpeed integration

// PyTorch on Gaudi:
import habana_frameworks.torch as ht
import torch

# Move model and data to HPU (Habana Processing Unit)
model = model.to("hpu")
data = data.to("hpu")

# Forward pass — Graph Compiler captures and optimizes
with ht.hpu.stream(ht.hpu.default_stream()):
    output = model(data)
    loss = criterion(output, target)

# Backward pass
loss.backward()
optimizer.step()

# Mark step — triggers execution of accumulated graph
ht.hpu.synchronize()

Software Maturity Comparison

📊

Software Ecosystem Maturity: Gaudi vs NVIDIA GPU

CapabilityNVIDIA CUDAGaudi SynapseGap
Framework support PyTorch, TF, JAX, Triton PyTorch (primary), TF Moderate
Custom kernels CUDA C++, Triton, CUTLASS TPC-C (limited docs) Large
Profiling Nsight Compute (detailed) Synapse Profiler (basic) Large
Model coverage Nearly all models Major models (LLaMA, BERT, GPT) Moderate
Quantization tools TensorRT, NVIDIA Quant Toolkit Synapse Quantization Toolkit Moderate
Distributed training NCCL, Megatron-LM, DeepSpeed HCCL, DeepSpeed integration Small
Inference serving TensorRT, vLLM, TGI vLLM (Gaudi backend), TGI Moderate
Community support Massive (millions of devs) Small (thousands) Very large
Note: The software gap is Gaudi's primary competitive disadvantage. Hardware performance is competitive with H100 on supported models; software breadth and depth significantly lag CUDA.
⚠️ Software Is Gaudi's Biggest Challenge

Gaudi hardware is competitive with H100 on paper. In practice, the Graph Compiler must optimize each model architecture specifically. New model architectures (e.g., Mixture of Experts with dynamic routing, State Space Models) may not be well-optimized until Habana engineers update the compiler. On NVIDIA, researchers can write custom CUDA kernels for day-one support of new architectures. This software velocity gap is the primary reason Gaudi has not achieved broader adoption.

ROI Analysis

Hardware Cost

// Server pricing (approximate, 2024):
// DGX H100 (8x H100 SXM): ~$200,000 - $300,000
// Intel Gaudi 2 server (8x Gaudi 2): ~$65,000 - $80,000
// Cost ratio: Gaudi 2 is ~3-4x cheaper per server

// Cloud pricing (Google Cloud, AWS):
// NVIDIA H100 (on-demand): ~$11-13/hr per GPU
// NVIDIA A100 (on-demand): ~$4-5/hr per GPU
// Intel Gaudi 2 (AWS dl1): ~$4-5/hr per accelerator (comparable to A100)
// Intel Gaudi 3 (not yet widely available): ~$6-8/hr per accelerator (est.)

Training Cost Comparison

// Workload: LLaMA-2 7B fine-tuning, 1 epoch, 100K samples
//
// H100 (FP8, 1 GPU):
//   Throughput: ~3,500 tokens/second
//   Time: ~8 hours
//   Cost: 8 hrs × $11/hr = $88
//
// Gaudi 2 (BF16, 1 accelerator):
//   Throughput: ~2,800 tokens/second (80% of H100)
//   Time: ~10 hours
//   Cost: 10 hrs × $4.50/hr = $45
//
// Cost ratio: Gaudi 2 is ~49% cheaper despite being ~20% slower
//
// LLaMA-2 70B full training (1T tokens):
// 8x H100 cluster: ~30 days × 24 hrs × $88/hr = $63,360
// 8x Gaudi 2 cluster: ~40 days × 24 hrs × $36/hr = $34,560
// Cost ratio: Gaudi 2 is ~45% cheaper, but 33% slower

Inference Cost Comparison

// Workload: LLaMA-2 7B serving, 100 req/s, 512 output tokens each
//
// H100 (FP8):
//   Throughput: ~4,000 tokens/s (batched)
//   GPUs needed: ceil(100 × 512 / 4000) = 13 GPUs
//   Monthly cost: 13 × $11/hr × 720 hrs = $102,960
//
// Gaudi 2 (BF16):
//   Throughput: ~2,500 tokens/s (batched)
//   Accelerators needed: ceil(100 × 512 / 2500) = 21 accelerators
//   Monthly cost: 21 × $4.50/hr × 720 hrs = $68,040
//
// Cost ratio: Gaudi 2 is 34% cheaper despite needing 62% more chips
// The lower per-chip cost more than compensates for lower throughput

Monthly Inference Cost: LLaMA-2 7B at 100 req/s

($/month)
H100 (FP8, 13 GPUs) $103K/month — high perf, high cost
102,960 $/month
Gaudi 2 (BF16, 21 chips) $68K/month — 34% cheaper
68,040 $/month
A100 (FP16, 22 GPUs) $79K/month — between the two
79,200 $/month

When Gaudi ROI Is Positive

  1. Supported model architectures: If your model (BERT, GPT, LLaMA, Stable Diffusion) is well-optimized in Synapse, Gaudi delivers 30-50% cost savings vs H100.

  2. Large-scale training where time-to-completion is flexible: If you can tolerate 20-30% longer training for 40-50% cost reduction, Gaudi is the better investment.

  3. Inference workloads with predictable models: Serving BERT for search ranking, T5 for text generation, or LLaMA for chatbots — all well-supported.

When Gaudi ROI Is Negative

  1. Cutting-edge research with new architectures: If you need to implement custom attention variants, novel loss functions, or new parallelism strategies, the CUDA ecosystem is dramatically faster to develop in.

  2. Time-sensitive training: When you need a model trained by a deadline (product launch), the 20-30% speed deficit matters more than the cost savings.

  3. Ecosystem lock-in concerns: NVIDIA’s CUDA ecosystem has decades of investment. Gaudi’s Synapse may face uncertain long-term support given Intel’s financial challenges.

🚨 Intel's Financial Situation and Gaudi's Future

Intel has faced significant financial pressures (2023-2025), including layoffs, foundry delays, and strategic restructuring. The Gaudi product line’s long-term future depends on Intel’s commitment to the AI accelerator market. Enterprises considering multi-year investments should factor this uncertainty into their ROI calculations. A 40% cost savings is meaningless if the platform is discontinued or support degrades.

Gaudi 3: Closing the Gap

Specifications

// Gaudi 3 key improvements over Gaudi 2:
// Compute: 1,835 BF16 TFLOPS (4.2x Gaudi 2, 1.85x H100)
// FP8: 3,670 TFLOPS (4.2x Gaudi 2, 1.85x H100)
// HBM: 128 GB HBM2e at 3,700 GB/s (1.5x Gaudi 2, ~1.1x H100)
// Networking: 24 × 200G = 600 GB/s (2x Gaudi 2)
// TPC count: 64 (2.7x Gaudi 2)
// Process: 5nm (vs 7nm Gaudi 2)
//
// Gaudi 3 is competitive with H100 on paper:
// Compute: 1.85x H100 (surprising given Intel's track record)
// Bandwidth: 1.1x H100
// Networking: ~67% of NVLink 4.0 (900 GB/s)
// HBM capacity: 1.6x H100

The Real Question: Software

// Gaudi 3 hardware matches or exceeds H100
// But hardware is necessary, not sufficient
//
// The question: can the Graph Compiler extract the hardware performance?
// On Gaudi 2, achieved throughput is typically 70-85% of theoretical
// On H100, CUDA libraries achieve 80-95% of theoretical
//
// If Gaudi 3 Graph Compiler achieves 80% efficiency:
//   1,835 × 0.80 = 1,468 TFLOPS effective
// If H100 CUDA achieves 90% efficiency:
//   990 × 0.90 = 891 TFLOPS effective
// Gaudi 3 effective advantage: 1.65x
//
// But if Gaudi 3 Graph Compiler only achieves 70%:
//   1,835 × 0.70 = 1,285 TFLOPS effective
// Advantage shrinks to 1.44x — still significant but less compelling
// And software development velocity (time to optimize new models) matters

Summary

Intel Gaudi represents a viable alternative to NVIDIA GPUs for specific workload profiles. The architecture’s strengths — independent MME/TPC engines for overlapped execution, integrated RDMA networking for lower infrastructure cost, and competitive raw compute throughput on Gaudi 3 — address real bottlenecks in AI training and inference infrastructure.

The ROI calculation favors Gaudi for cost-conscious organizations running well-supported models at scale, with 30-50% cost savings typical for training and inference of standard architectures. The calculation disfavors Gaudi for organizations that need cutting-edge research flexibility, custom kernel development, or the certainty of the CUDA ecosystem’s long-term support.

The honest assessment: Gaudi hardware is good enough to win on price-performance for many workloads. The software ecosystem is the bottleneck — narrower model coverage, less profiling capability, smaller community, and uncertain long-term investment from Intel. If Intel resolves the software and commitment questions, Gaudi could capture significant market share. If not, it remains a niche cost-optimization play.