Part of Series GPU Hardware & AI Accelerators 20 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

The B200 consumes 1,000 watts. A single NVL72 rack containing 72 B200 GPUs draws over 120 kW — enough to power 40 average US homes. A 100,000-GPU training cluster requires 100-150 megawatts of power, comparable to a small city. At 0.050.10perkWh,theelectricitycostaloneforafrontiermodeltrainingrun(36months)reaches0.05-0.10 per kWh, the electricity cost alone for a frontier model training run (3-6 months) reaches 15-45 million. Power is no longer a secondary concern — it is a primary cost driver and a hard physical constraint on system design.

The industry trend is clear: each GPU generation delivers more TFLOPS but at higher absolute power. The H100 at 700W was the first GPU to require liquid cooling as a standard configuration. The B200 at 1,000W makes liquid cooling mandatory. Datacenter operators are running into transformer capacity limits, grid interconnection queues, and physical space constraints for cooling infrastructure.

This post analyzes GPU power consumption at every level — from transistor-level DVFS to rack-level power budgets — and quantifies the performance-per-watt trends that determine the economics of AI computing.

Power Consumption Fundamentals

Dynamic and Static Power

GPU power consumption has two components:

Ptotal=Pdynamic+PstaticP_{total} = P_{dynamic} + P_{static}

Dynamic power is consumed when transistors switch states:

Pdynamic=αCVdd2fP_{dynamic} = \alpha \cdot C \cdot V_{dd}^2 \cdot f

Where α\alpha is the activity factor (fraction of transistors switching per cycle), CC is the total capacitance, VddV_{dd} is the supply voltage, and ff is the clock frequency.

Static power (leakage) flows through transistors even when they are not switching:

Pstatic=VddIleakP_{static} = V_{dd} \cdot I_{leak}

Leakage current IleakI_{leak} increases exponentially with temperature and decreases with higher threshold voltage (which also reduces performance).

// Power breakdown for a typical data center GPU (H100):
// Dynamic power: ~70% of TDP (~490W out of 700W)
//   - Compute units (SMs): ~45% (~315W)
//   - Memory subsystem (HBM, L2): ~15% (~105W)
//   - Interconnect (NVLink): ~5% (~35W)
//   - Other (clock trees, IO): ~5% (~35W)
//
// Static power (leakage): ~30% of TDP (~210W)
//   - Increases with temperature
//   - Higher on advanced process nodes (smaller transistors leak more)
//   - Present even when GPU is idle but powered on

Dennard Scaling Breakdown

Classical Dennard scaling predicted that as transistors shrink, voltage and current scale proportionally, keeping power density constant. This broke down around the 28nm node:

📊

Power Density Across Process Nodes

Process NodeGPU ExampleDie AreaTDPPower Density (W/mm2)
28nm K80 (per die) 561 mm² 150 W 0.27
16nm P100 610 mm² 300 W 0.49
12nm V100 815 mm² 300 W 0.37
7nm A100 826 mm² 400 W 0.48
4nm H100 814 mm² 700 W 0.86
4nm B200 (total) ~1,060 mm² 1,000 W 0.94
Note: Power density has nearly tripled from K80 to B200. Post-Dennard, smaller transistors do not reduce voltage proportionally, so cramming more transistors into the same area increases power density.

Dynamic Voltage and Frequency Scaling (DVFS)

How DVFS Works on GPUs

NVIDIA GPUs continuously adjust voltage and frequency based on workload, temperature, and power limits:

// DVFS operating points (H100, approximate):
// Base clock: 1,095 MHz at ~0.75V
// Boost clock: 1,830 MHz at ~0.95V
// Maximum (overclocked): ~2,000 MHz at ~1.05V
//
// Power scales with V² × f:
// At base (1,095 MHz, 0.75V): P ∝ 0.75² × 1.095 = 0.616 (relative)
// At boost (1,830 MHz, 0.95V): P ∝ 0.95² × 1.830 = 1.652 (relative)
// Ratio: 2.68x more power for 1.67x more frequency
//
// Performance scales linearly with frequency (approximately):
// Performance increase: 1.67x
// Power increase: 2.68x
// Efficiency loss: 1.67 / 2.68 = 0.623 → 37.7% less efficient at boost clock
//
// This is the fundamental tradeoff:
// Higher frequency = more performance but disproportionately more power

GPU Boost Behavior

# Monitor GPU clock and power in real-time
nvidia-smi dmon -d 1 -s pc
# pwr  clk  mclk
# 680  1830  2619   # Full boost: high power, high clock
# 672  1815  2619   # Slight thermal throttle
# 450  1530  2619   # Power-capped: reduced clock to stay under TDP
# 85   210   2619   # Idle: minimal clock

# Query current clock and power limits
nvidia-smi -q -d PERFORMANCE
# Current GPU clock: 1830 MHz
# Max GPU clock: 1830 MHz
# Current memory clock: 2619 MHz
# Power limit: 700 W
# Enforced power limit: 700 W
# Current power draw: 672 W

The Power-Frequency Curve

perf_per_watt(f)=fαCV(f)2f=1αCV(f)2\text{perf\_per\_watt}(f) = \frac{f}{\alpha \cdot C \cdot V(f)^2 \cdot f} = \frac{1}{\alpha \cdot C \cdot V(f)^2}

Since V(f)V(f) increases with ff (higher voltage needed for higher frequency), performance per watt decreases at higher clocks. The most efficient operating point is the lowest voltage that sustains a given frequency.

H100 Performance per Watt vs Clock Frequency

(% of peak efficiency)
1,100 MHz (base) 100% efficiency baseline
100 % of peak efficiency
1,300 MHz 93% efficiency
93 % of peak efficiency
1,500 MHz 84% efficiency
84 % of peak efficiency
1,700 MHz 73% efficiency
73 % of peak efficiency
1,830 MHz (boost) 63% efficiency — 37% perf/watt loss
63 % of peak efficiency
The Optimal Clock Is Not the Maximum Clock

For datacenter deployments where total cost includes electricity, the optimal clock frequency is below the maximum boost. Running at 80% of boost clock typically achieves 90% of peak performance at 70% of peak power — a significantly better perf/watt operating point. NVIDIA provides power capping features to enforce this.

Power Capping

Setting Power Limits

# Set GPU power limit (requires root/admin)
nvidia-smi -i 0 -pl 500
# GPU 0: power limit set to 500 W (default: 700 W)

# The GPU will reduce clock frequency to stay within 500 W
# Performance impact: approximately proportional to sqrt of power reduction
# 500/700 = 71.4% of power → ~85% of performance
# (because P ∝ V²f and performance ∝ f)

# Verify the power limit
nvidia-smi -q -d POWER -i 0
# Power Limit: 500 W
# Default Power Limit: 700 W
# Enforced Power Limit: 500 W
# Min Power Limit: 200 W
# Max Power Limit: 700 W

Power Capping Impact on Workloads

📊

Power Cap Impact on H100 Workload Performance

Power CapGEMM ThroughputLLM Inference (decode)Memory-Bound KernelEfficiency (TFLOPS/W)
700 W (default) 990 TFLOPS 100% 100% 1.41
600 W (86%) 920 TFLOPS 97% 99% 1.53
500 W (71%) 820 TFLOPS 93% 98% 1.64
400 W (57%) 680 TFLOPS 85% 96% 1.70
300 W (43%) 520 TFLOPS 72% 93% 1.73
200 W (29%) 340 TFLOPS 55% 85% 1.70
Note: Memory-bound kernels (LLM decode, attention) are less affected by power capping because the bottleneck is HBM bandwidth, not compute frequency. Compute-bound kernels (GEMM) scale more directly with clock reduction. Peak efficiency is at 300-400W — well below the 700W TDP.
// Programmatic power management via NVML
#include <nvml.h>

void set_power_limit(unsigned int gpu_id, unsigned int watts) {
    nvmlDevice_t device;
    nvmlDeviceGetHandleByIndex(gpu_id, &device);

    // Set power limit in milliwatts
    nvmlReturn_t ret = nvmlDeviceSetPowerManagementLimit(device, watts * 1000);
    if (ret == NVML_SUCCESS) {
        printf("GPU %u: power limit set to %u W\n", gpu_id, watts);
    }

    // Query actual power draw
    unsigned int power_mw;
    nvmlDeviceGetPowerUsage(device, &power_mw);
    printf("GPU %u: current power draw: %u mW\n", gpu_id, power_mw);
}

// Set all GPUs to 500W for maximum efficiency
void optimize_fleet_power(int num_gpus) {
    nvmlInit();
    for (int i = 0; i < num_gpus; i++) {
        set_power_limit(i, 500);
    }
}
ℹ️ Memory-Bound Workloads Benefit Most from Power Capping

LLM inference decode is memory-bandwidth-bound. The GPU’s compute units are mostly idle, waiting for HBM data. Reducing clock frequency (via power cap) barely affects HBM bandwidth (memory clock is independent of GPU clock). A 30% power reduction may only cause a 5-7% throughput reduction for decode — yielding a massive efficiency improvement. Compute-bound workloads (GEMM, prefill) see more proportional performance loss.

Performance per Watt Across Generations

The Historical Trend

📊

AI Performance per Watt Across GPU Generations

GPUYearFP8/FP16 TFLOPSTDP (W)TFLOPS/WImprovement
P100 (FP16) 2016 21.2 300 0.071 Baseline
V100 (FP16 Tensor) 2017 125 300 0.417 5.9x
A100 (FP16 Tensor) 2020 312 400 0.780 11.0x
H100 (FP8 Tensor) 2022 1,979 700 2.827 39.8x
B200 (FP8 Tensor) 2024 4,500 1,000 4.500 63.4x
B200 (FP4 Tensor) 2024 9,000 1,000 9.000 126.8x
Note: Performance per watt has improved 127x from P100 to B200 FP4 over 8 years. This improvement comes from tensor cores (dedicated matrix hardware), lower precision (FP16 → FP8 → FP4), and process node shrinks.

TFLOPS per Watt (FP8-Equivalent) Across Generations

(TFLOPS/W)
P100 (2016, FP16) 0.07 TFLOPS/W
0.07 TFLOPS/W
V100 (2017, FP16 TC) 0.42 TFLOPS/W
0.42 TFLOPS/W
A100 (2020, FP16 TC) 0.78 TFLOPS/W
0.78 TFLOPS/W
H100 (2022, FP8 TC) 2.83 TFLOPS/W
2.83 TFLOPS/W
B200 (2024, FP8 TC) 4.50 TFLOPS/W
4.5 TFLOPS/W

Where the Efficiency Gains Come From

// Efficiency improvement decomposition (P100 → B200):
//
// 1. Tensor cores: ~10x
//    P100: FP32 CUDA cores, 2 FLOPs per core per cycle
//    V100+: Tensor cores, 128-1024+ FLOPs per core per cycle
//    Same silicon area, 10x more useful work
//
// 2. Lower precision: ~4x
//    FP32 (P100) → FP16 (V100) → FP8 (H100) → FP4 (B200)
//    Each halving of precision doubles throughput at constant power
//    (half the bits = half the data movement + half the multiply energy)
//
// 3. Process node improvement: ~2x
//    16nm (P100) → 12nm (V100) → 7nm (A100) → 4nm (H100/B200)
//    Smaller transistors switch with less energy
//    ~30-40% energy reduction per node generation
//
// 4. Architecture optimization: ~1.5x
//    Better data reuse (larger on-chip buffers)
//    TMA (Tensor Memory Accelerator) reduces data movement
//    Transformer Engine automates precision selection
//
// Combined: 10 × 4 × 2 × 1.5 ≈ 120x (matches observed 127x)

Datacenter Power and Cooling

Power Hierarchy

// Datacenter power flow for an AI training cluster:
// Utility power (medium voltage) → Transformer → UPS → PDU → Server → GPU
//
// Losses at each stage:
// Transformer: ~2% loss
// UPS (online double-conversion): ~5-8% loss
// PDU (power distribution unit): ~2-3% loss
// Server PSU (AC-DC conversion): ~5-8% loss
// Voltage regulation modules (VRM): ~3-5% loss
//
// Power Usage Effectiveness (PUE):
// PUE = Total facility power / IT equipment power
// Best-in-class datacenter: PUE = 1.1 (10% overhead for cooling + infrastructure)
// Average datacenter: PUE = 1.3-1.5 (30-50% overhead)
//
// For a 100 MW IT load at PUE 1.2:
// Total facility power: 120 MW
// Cooling: ~15 MW
// Infrastructure (lighting, networking, storage): ~5 MW

Per-Rack Power Budgets

📊

AI Server Rack Power Density

SystemGPUs per RackGPU PowerTotal Rack PowerCooling
DGX A100 (4 servers) 32 A100 12.8 kW ~25 kW Air-cooled
DGX H100 (4 servers) 32 H100 22.4 kW ~40 kW Liquid-cooled (GPU)
NVL72 (single rack) 72 B200 72 kW ~120 kW Full liquid cooling
TPU v4 pod slice (rack) 64 TPU v4 ~11.2 kW ~20 kW Liquid-cooled
Note: NVL72 at 120 kW per rack is 3x higher than the traditional 40 kW datacenter rack limit. Purpose-built facilities with rear-door heat exchangers or direct-to-chip liquid cooling are required.

Liquid Cooling Requirements

// Air cooling capacity:
// Traditional datacenter rack: 10-20 kW
// High-density air cooling: 30-40 kW (with hot/cold aisle containment)
// Beyond 40 kW: air cooling is impractical (fan power exceeds useful threshold)

// Liquid cooling options:
// 1. Rear-door heat exchanger (RDHx):
//    Water loop attached to rack rear door
//    Capacity: 40-80 kW per rack
//    No modifications to servers — air inside rack, water at boundary
//    Used by: DGX H100 (initial deployments)

// 2. Direct-to-chip (cold plate):
//    Water-cooled cold plates bolted to GPU and CPU
//    Capacity: 100+ kW per rack
//    Requires plumbing inside the server
//    Used by: NVL72, DGX GB200

// 3. Immersion cooling:
//    Entire server submerged in dielectric fluid
//    Capacity: 200+ kW per rack
//    Extreme but effective for highest densities
//    Used by: some HPC installations, experimental AI clusters

// Cost comparison (per rack):
// Air cooling: ~$5,000-10,000 (fans, ducting, CRAC)
// RDHx: ~$15,000-30,000 (water distribution, heat exchanger)
// Direct-to-chip: ~$30,000-60,000 (plumbing, CDU, manifolds)
// Immersion: ~$50,000-100,000 (tank, fluid, CDU)

The Cooling Efficiency Opportunity

Liquid cooling not only handles more heat — it is more efficient than air cooling:

// Air cooling PUE contribution: ~0.3-0.5 (adds 30-50% power for cooling)
// Water cooling PUE contribution: ~0.05-0.15 (adds 5-15% for cooling)
//
// For a 100 MW IT load:
// Air-cooled datacenter (PUE 1.4): 140 MW total, 40 MW for cooling
// Water-cooled datacenter (PUE 1.1): 110 MW total, 10 MW for cooling
// Savings: 30 MW → at $0.07/kWh = $18.4 million/year
//
// The capital cost of liquid cooling infrastructure is repaid
// by electricity savings within 1-2 years for large AI clusters

Power Optimization Strategies

Strategy 1: Workload-Aware Power Capping

# Dynamic power management for inference serving
# Reduce power during low-traffic periods

import pynvml
pynvml.nvmlInit()

def adjust_power_for_load(gpu_id, request_rate, max_rate):
    handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id)
    default_limit = pynvml.nvmlDeviceGetPowerManagementDefaultLimit(handle)

    if request_rate < max_rate * 0.3:
        # Low load: reduce to 60% power
        pynvml.nvmlDeviceSetPowerManagementLimit(handle, int(default_limit * 0.6))
    elif request_rate < max_rate * 0.7:
        # Medium load: reduce to 80% power
        pynvml.nvmlDeviceSetPowerManagementLimit(handle, int(default_limit * 0.8))
    else:
        # High load: full power
        pynvml.nvmlDeviceSetPowerManagementLimit(handle, default_limit)

Strategy 2: Precision Selection for Efficiency

// Lower precision = fewer bit operations = less power per operation
//
// Energy per operation (approximate):
// FP32 multiply: ~3.7 pJ
// FP16 multiply: ~1.1 pJ
// FP8 multiply:  ~0.4 pJ
// FP4 multiply:  ~0.15 pJ
// INT8 multiply: ~0.2 pJ
//
// Moving from FP16 to FP8 reduces compute energy by ~2.75x
// Plus: half the data movement (half the bits through memory hierarchy)
// Total energy savings: ~3-4x for FP8 vs FP16
//
// This is why FP8 inference is not just faster — it's fundamentally more
// energy-efficient. The same result with ~3-4x less power per operation.

Strategy 3: Batch Size Optimization

// Batching amortizes fixed power costs (leakage, memory refresh, clock trees)
// across more useful work
//
// H100 inference power breakdown (batch=1):
// Static (leakage + memory refresh): ~250 W (fixed regardless of work)
// Dynamic (compute): ~50 W (very low — GPU mostly idle, waiting on memory)
// Total: ~300 W for 1 token's work
// Energy per token: 300 W / 25 tokens/s = 12 J/token
//
// H100 inference power breakdown (batch=64):
// Static: ~250 W (same)
// Dynamic: ~400 W (compute units heavily utilized)
// Total: ~650 W for 64 tokens' work
// Per-token throughput: ~1,600 tokens/s
// Energy per token: 650 W / 1,600 tokens/s = 0.41 J/token
//
// Batching reduces energy per token by 29x (12 / 0.41)
// This is the single most important power optimization for inference

Energy per Token vs Batch Size (H100, LLaMA-70B FP8)

(joules per token (lower is better))
Batch 1 12.0 J/token — GPU mostly idle
12 joules per token (lower is better)
Batch 4 4.2 J/token — some amortization
4.2 joules per token (lower is better)
Batch 16 1.5 J/token — efficient
1.5 joules per token (lower is better)
Batch 64 0.41 J/token — near-optimal
0.41 joules per token (lower is better)
Batch 256 0.35 J/token — diminishing returns
0.35 joules per token (lower is better)

Strategy 4: Clock Frequency Optimization

# Find the efficiency-optimal clock frequency
# Run a sweep: reduce max clock, measure throughput and power

for FREQ in 1830 1600 1400 1200 1000; do
    nvidia-smi -i 0 -lgc 210,$FREQ  # Lock GPU clock range
    # Run benchmark
    ./benchmark --model llama-70b --batch 64 --tokens 1000 > /dev/null 2>&1
    # Record power and throughput
    nvidia-smi -i 0 -q -d POWER | grep "Power Draw"
    nvidia-smi -i 0 -q -d CLOCK | grep "Graphics"
done
nvidia-smi -i 0 -rgc  # Reset clock to default

# Typical result:
# 1830 MHz: 1,600 tok/s, 680W → 2.35 tok/s/W
# 1600 MHz: 1,480 tok/s, 550W → 2.69 tok/s/W  ← best efficiency
# 1400 MHz: 1,320 tok/s, 440W → 3.00 tok/s/W  ← even better
# 1200 MHz: 1,100 tok/s, 350W → 3.14 tok/s/W  ← diminishing returns
# 1000 MHz:   880 tok/s, 280W → 3.14 tok/s/W  ← plateau

Cluster-Level Power Economics

10,000-GPU Training Cluster

// Cluster specification: 10,000 H100 GPUs
// Server configuration: 1,250 DGX H100 servers (8 GPUs each)
//
// GPU power: 10,000 × 700W = 7 MW
// Server overhead (CPU, memory, fans, PSU losses): ~3 MW
// Total IT power: ~10 MW
//
// Infrastructure (PUE 1.2):
// Cooling: ~1.5 MW
// Power distribution losses: ~0.5 MW
// Total facility power: ~12 MW
//
// Cost analysis:
// Electricity: 12 MW × $0.07/kWh × 8,760 hrs/yr = $7.36 million/year
// Hardware (GPUs): 1,250 × $200,000 = $250 million
// Hardware (networking): ~$50 million
// Datacenter (space, cooling): ~$50 million/year (lease + operations)
//
// 3-year TCO:
// Hardware: $300 million (depreciated over 3 years)
// Electricity: $22 million
// Datacenter: $150 million
// Total: $472 million
//
// Electricity is ~5% of 3-year TCO for GPU clusters
// (Hardware dominates, but electricity scales with duration)

Power Capping the Fleet

// What if we cap all 10,000 GPUs to 500W instead of 700W?
//
// Performance: ~83% of peak (memory-bound workloads barely affected)
// GPU power: 10,000 × 500W = 5 MW (vs 7 MW)
// Power saved: 2 MW
// Electricity savings: 2 MW × $0.07/kWh × 8,760 = $1.23 million/year
//
// Training time increase: ~20% (for compute-bound training)
// Training cost increase: 20% more GPU-hours
// GPU-hour cost at $2/hr: 20% × 10,000 GPUs × 2,000 training hours × $2 = $8M extra
//
// Net result: save $1.23M electricity, spend $8M more on training time
// Power capping is NOT cost-effective for training if you own the hardware
//
// BUT: if the datacenter cannot provide 10 MW (physical constraint):
// Power capping to 500W enables 40% more GPUs in the same power envelope
// 14,000 GPUs × 500W = 7 MW (same as 10,000 × 700W)
// 14,000 GPUs at 83% ≈ 11,620 "effective GPUs"
// 16.2% more effective compute in the same power budget
Power Capping Is a Physical Constraint Tool, Not a Cost Tool

For training clusters, the economics rarely favor power capping — the opportunity cost of slower training exceeds the electricity savings. However, when the datacenter has a fixed power budget (transformer capacity, utility contract), power capping allows fitting more GPUs and more total compute into the same power envelope. This is the correct framing: power capping trades per-GPU performance for total-cluster performance under a power constraint.

The Inference Power Equation

For inference serving (not training), the economics are different:

// Inference is always-on (24/7/365), unlike training (burst workload)
// Electricity costs compound over years
//
// 1,000 H100 inference cluster:
// Full power (700W): 1,000 × 700W = 700 kW
// Optimized (500W, serving LLM decode): 500 kW
// Performance impact: ~5% throughput reduction (memory-bound decode)
// Electricity savings: 200 kW × $0.07/kWh × 8,760 = $122,640/year
// Throughput cost: ~5% → need ~50 more GPUs = 50 × $30,000 = $1.5M one-time
// Payback: $1.5M / $122,640 = 12.2 years (NOT worth it for electricity alone)
//
// BUT with liquid cooling savings (lower PUE):
// Air-cooled at PUE 1.4: 700 kW × 1.4 = 980 kW facility power
// Liquid-cooled at PUE 1.1 + power cap: 500 kW × 1.1 = 550 kW facility power
// Savings: 430 kW → $264K/year
// Payback for liquid cooling + power cap: ~2-3 years

Future: The Power Wall

What Limits GPU Power Growth

// B200: 1,000 W per GPU, 120 kW per NVL72 rack
// Next generation (hypothetical "R100"): ~1,200-1,500 W per GPU?
//
// Physical limits:
// 1. Power delivery: current VRM technology handles ~1,500W per module
//    Beyond that: need novel power delivery (integrated voltage regulators)
// 2. Cooling: direct-to-chip liquid cooling handles ~2 kW per chip
//    Beyond that: need phase-change cooling or novel materials
// 3. Datacenter infrastructure: most facilities top out at 100-150 kW per rack
//    NVL72 at 120 kW is already at this limit
//    Next generation may need 200+ kW per rack → custom facilities
//
// The industry response:
// - More efficient architectures (domain-specific, less general-purpose waste)
// - Lower precision (FP4, eventually binary/ternary?)
// - Better data reuse (larger on-chip buffers, smarter caching)
// - Chiplet designs (distribute power across multiple smaller dies)
// - New process nodes (2nm, 1.4nm from TSMC) with better Vdd scaling

Summary

GPU power efficiency has improved 127x from P100 to B200 (FP4), driven by tensor cores (10x), lower precision (4x), process shrinks (2x), and architectural improvements (1.5x). Despite this, absolute power consumption has increased from 300W to 1,000W per GPU, because each generation fills the power budget with more transistors.

The practical implications: (1) power capping is a tool for fitting more GPUs into a fixed power budget, not primarily for reducing electricity costs; (2) memory-bound workloads (LLM decode) benefit most from power capping because HBM bandwidth is clock-independent; (3) batching is the most effective power optimization, reducing per-token energy by 30x; (4) liquid cooling is mandatory for current-generation hardware and pays for itself through PUE reduction within 2-3 years; and (5) precision reduction (FP32 to FP8 to FP4) is simultaneously a performance optimization and a power optimization, providing 3-4x energy savings per precision halving.

For datacenter operators, the key equation is not TFLOPS/W per GPU — it is total useful work per megawatt at the facility level, including cooling, power distribution, and infrastructure. Optimizing that metric requires co-design across hardware selection, workload scheduling, power management, and cooling infrastructure.