Intel’s Habana Gaudi2 presents a fundamentally different memory architecture than NVIDIA GPUs. Understanding these differences is essential for optimizing LLM inference workloads on Gaudi hardware.
Gaudi2 Memory Architecture Overview
Gaudi2 Memory Hierarchy
L0 L1 SRAM HBM 256KB per TPC 24MB total 48MB 96GB Key architectural differences from NVIDIA:
- 24 Tensor Processing Cores (TPCs) vs NVIDIA’s streaming multiprocessors
- Large on-chip SRAM (48MB) vs NVIDIA’s ~40MB L2
- Different programming model: Graph-based compilation vs CUDA kernels
Memory Bandwidth Characteristics
Gaudi2 achieves theoretical 2.45 TB/s HBM bandwidth, but sustained bandwidth depends heavily on access patterns:
Gaudi2 Memory Bandwidth by Access Pattern
| Access Pattern | Achieved BW | Efficiency | Notes |
|---|---|---|---|
| Sequential 256B | 2.38 TB/s | 97% | Optimal |
| Sequential 64B | 2.21 TB/s | 90% | Good |
| Strided 256B | 1.84 TB/s | 75% | Common in attention |
| Random 64B | 0.73 TB/s | 30% | Avoid |
| Random 32B | 0.42 TB/s | 17% | Very bad |
TPC Memory Access Optimization
The TPC is Gaudi’s compute unit. Each TPC has:
- 256KB local SRAM (equivalent to shared memory)
- Vector and scalar execution units
- Hardware prefetch capabilities
// Optimized TPC kernel for matrix-vector multiplication
// Key: Ensure 256-byte aligned accesses and utilize local SRAM
void tpc_optimized_matvec(
__global__ float* __restrict__ matrix, // [M, K]
__global__ float* __restrict__ vector, // [K]
__global__ float* __restrict__ output, // [M]
int M, int K
) {
// TPC local memory allocation
__local__ float vec_cache[256]; // Cache vector chunk in SRAM
int row_start = get_tpc_id() * ROWS_PER_TPC;
int row_end = min(row_start + ROWS_PER_TPC, M);
// Process vector in chunks that fit in local memory
for (int k_base = 0; k_base < K; k_base += 256) {
int k_end = min(k_base + 256, K);
// Collaborative load: all TPCs load same vector chunk
// HW broadcast optimization kicks in
async_load_256b_aligned(&vec_cache[0], &vector[k_base], k_end - k_base);
barrier();
// Compute partial sums using cached vector
for (int row = row_start; row < row_end; row++) {
float sum = 0.0f;
#pragma unroll 8
for (int k = 0; k < k_end - k_base; k++) {
// 256-byte aligned row access
sum += matrix[row * K + k_base + k] * vec_cache[k];
}
output[row] += sum;
}
}
}
Gaudi2’s HBM controller achieves peak bandwidth only with 256-byte aligned accesses. Misaligned accesses can reduce effective bandwidth by 40%.
SRAM Utilization for KV Cache
Gaudi2’s 48MB on-chip SRAM can hold significant KV cache portions, reducing HBM pressure:
# Calculate optimal KV cache SRAM allocation
def calculate_sram_kv_allocation(
num_layers: int,
num_heads: int,
head_dim: int,
dtype_bytes: int = 2 # FP16
) -> dict:
"""
Determine how much KV cache can fit in Gaudi2 SRAM.
"""
sram_budget = 48 * 1024 * 1024 # 48MB
reserved_for_compute = 8 * 1024 * 1024 # 8MB for intermediate buffers
available_sram = sram_budget - reserved_for_compute
# KV cache entry size per token per layer
kv_per_token_per_layer = 2 * num_heads * head_dim * dtype_bytes
# Total KV per token across all layers
kv_per_token_total = kv_per_token_per_layer * num_layers
# Tokens that fit in SRAM
tokens_in_sram = available_sram // kv_per_token_total
# For Llama-70B (80 layers, 64 heads, 128 dim):
# kv_per_token = 2 * 64 * 128 * 2 * 80 = 2.62MB per token
# tokens_in_sram = 40MB / 2.62MB ≈ 15 tokens
# Alternative: Cache subset of layers
layers_to_cache = available_sram // (kv_per_token_per_layer * 2048) # 2048 tokens
return {
'full_cache_tokens': tokens_in_sram,
'partial_cache_layers': layers_to_cache,
'recommendation': 'partial' if tokens_in_sram < 256 else 'full'
}
Prefetch Engine Configuration
Gaudi2’s hardware prefetcher requires explicit hints for optimal performance:
// Configure TPC prefetch for attention computation
void configure_attention_prefetch(
AttentionConfig* config,
int batch_size,
int seq_len
) {
// Set prefetch distance based on access pattern
// Attention has predictable sequential K,V access
PrefetchConfig pf_config = {
.distance_lines = 8, // Prefetch 8 cache lines ahead
.stride = config->head_dim * sizeof(float16),
.direction = PREFETCH_FORWARD,
.enable_cross_tpc = true, // Enable for multi-TPC cooperation
};
tpc_set_prefetch_config(&pf_config);
// For V matrix access (different pattern)
PrefetchConfig v_config = {
.distance_lines = 4,
.stride = seq_len * config->head_dim * sizeof(float16),
.direction = PREFETCH_STRIDED,
.enable_cross_tpc = true,
};
tpc_set_prefetch_config_secondary(&v_config);
}
Graph Compilation Optimization
Gaudi2 uses graph-based compilation. Memory optimization happens at graph level:
import habana_frameworks.torch as ht
import torch
def optimize_inference_graph(model, sample_input):
"""
Configure graph compilation for optimal memory usage.
"""
# Enable graph compilation with memory optimization
ht.hpu.enable_inference_mode()
# Configure memory optimization level
# Level 2: Aggressive tensor reuse
# Level 3: Cross-layer optimization (may increase compile time)
ht.core.hpu_set_env("PT_HPU_LAZY_MEMORY_OPTIMIZATION_LEVEL", "3")
# Enable SRAM offload for large tensors
ht.core.hpu_set_env("PT_HPU_ENABLE_SRAM_OFFLOAD", "1")
# Set maximum SRAM allocation per graph
ht.core.hpu_set_env("PT_HPU_MAX_SRAM_PER_GRAPH_MB", "40")
# Compile and cache the graph
with torch.no_grad():
# First run triggers compilation
model(sample_input)
ht.hpu.synchronize()
# Second run uses cached graph
model(sample_input)
ht.hpu.synchronize()
return model
Memory Bandwidth Profiling
Use Habana’s profiler to analyze memory behavior:
# Enable detailed memory profiling
export HABANA_PROFILE=1
export HABANA_PROFILE_EVENTS="memory,tpc,dma"
# Run inference with profiling
python inference.py --model llama-70b --batch-size 8
# Analyze results
hl-prof-tools analyze --input profile_output/ --report memory
Memory Bandwidth Utilization Across Inference Phases
(TB/s)Comparison with A100
Memory Subsystem Comparison: Gaudi2 vs A100
| Metric | Gaudi2 | A100-80GB | Winner |
|---|---|---|---|
| HBM Capacity | 96GB | 80GB | Gaudi2 |
| HBM Bandwidth | 2.45 TB/s | 2.0 TB/s | Gaudi2 |
| On-chip SRAM | 48MB | 40MB L2 | Gaudi2 |
| SRAM Bandwidth | ~2.4 TB/s | ~5 TB/s | A100 |
| Memory Access Flex | Graph-constrained | Dynamic | A100 |
Practical Optimization Checklist
- Ensure 256-byte alignment for all tensor allocations
- Use SRAM for hot data (attention scores, small activations)
- Configure prefetch hints for predictable access patterns
- Enable graph memory optimization level 2 or 3
- Profile memory events to identify bandwidth bottlenecks
- Batch appropriately: Gaudi2 prefers larger batches for memory efficiency
Simply ensuring tensor alignment and enabling SRAM offload can improve memory-bound workload performance by 20-30% on Gaudi2.
Conclusion
Gaudi2’s memory subsystem offers compelling advantages for LLM inference: larger HBM, higher bandwidth, and substantial on-chip SRAM. However, realizing these advantages requires understanding the graph compilation model and explicitly optimizing memory access patterns. The techniques presented here can improve memory bandwidth utilization from typical 60% to 85%+.