The B200 consumes 1,000 watts. A single NVL72 rack containing 72 B200 GPUs draws over 120 kW — enough to power 40 average US homes. A 100,000-GPU training cluster requires 100-150 megawatts of power, comparable to a small city. At 15-45 million. Power is no longer a secondary concern — it is a primary cost driver and a hard physical constraint on system design.
The industry trend is clear: each GPU generation delivers more TFLOPS but at higher absolute power. The H100 at 700W was the first GPU to require liquid cooling as a standard configuration. The B200 at 1,000W makes liquid cooling mandatory. Datacenter operators are running into transformer capacity limits, grid interconnection queues, and physical space constraints for cooling infrastructure.
This post analyzes GPU power consumption at every level — from transistor-level DVFS to rack-level power budgets — and quantifies the performance-per-watt trends that determine the economics of AI computing.
Power Consumption Fundamentals
Dynamic and Static Power
GPU power consumption has two components:
Dynamic power is consumed when transistors switch states:
Where is the activity factor (fraction of transistors switching per cycle), is the total capacitance, is the supply voltage, and is the clock frequency.
Static power (leakage) flows through transistors even when they are not switching:
Leakage current increases exponentially with temperature and decreases with higher threshold voltage (which also reduces performance).
// Power breakdown for a typical data center GPU (H100):
// Dynamic power: ~70% of TDP (~490W out of 700W)
// - Compute units (SMs): ~45% (~315W)
// - Memory subsystem (HBM, L2): ~15% (~105W)
// - Interconnect (NVLink): ~5% (~35W)
// - Other (clock trees, IO): ~5% (~35W)
//
// Static power (leakage): ~30% of TDP (~210W)
// - Increases with temperature
// - Higher on advanced process nodes (smaller transistors leak more)
// - Present even when GPU is idle but powered on
Dennard Scaling Breakdown
Classical Dennard scaling predicted that as transistors shrink, voltage and current scale proportionally, keeping power density constant. This broke down around the 28nm node:
Power Density Across Process Nodes
| Process Node | GPU Example | Die Area | TDP | Power Density (W/mm2) |
|---|---|---|---|---|
| 28nm | K80 (per die) | 561 mm² | 150 W | 0.27 |
| 16nm | P100 | 610 mm² | 300 W | 0.49 |
| 12nm | V100 | 815 mm² | 300 W | 0.37 |
| 7nm | A100 | 826 mm² | 400 W | 0.48 |
| 4nm | H100 | 814 mm² | 700 W | 0.86 |
| 4nm | B200 (total) | ~1,060 mm² | 1,000 W | 0.94 |
Dynamic Voltage and Frequency Scaling (DVFS)
How DVFS Works on GPUs
NVIDIA GPUs continuously adjust voltage and frequency based on workload, temperature, and power limits:
// DVFS operating points (H100, approximate):
// Base clock: 1,095 MHz at ~0.75V
// Boost clock: 1,830 MHz at ~0.95V
// Maximum (overclocked): ~2,000 MHz at ~1.05V
//
// Power scales with V² × f:
// At base (1,095 MHz, 0.75V): P ∝ 0.75² × 1.095 = 0.616 (relative)
// At boost (1,830 MHz, 0.95V): P ∝ 0.95² × 1.830 = 1.652 (relative)
// Ratio: 2.68x more power for 1.67x more frequency
//
// Performance scales linearly with frequency (approximately):
// Performance increase: 1.67x
// Power increase: 2.68x
// Efficiency loss: 1.67 / 2.68 = 0.623 → 37.7% less efficient at boost clock
//
// This is the fundamental tradeoff:
// Higher frequency = more performance but disproportionately more power
GPU Boost Behavior
# Monitor GPU clock and power in real-time
nvidia-smi dmon -d 1 -s pc
# pwr clk mclk
# 680 1830 2619 # Full boost: high power, high clock
# 672 1815 2619 # Slight thermal throttle
# 450 1530 2619 # Power-capped: reduced clock to stay under TDP
# 85 210 2619 # Idle: minimal clock
# Query current clock and power limits
nvidia-smi -q -d PERFORMANCE
# Current GPU clock: 1830 MHz
# Max GPU clock: 1830 MHz
# Current memory clock: 2619 MHz
# Power limit: 700 W
# Enforced power limit: 700 W
# Current power draw: 672 W
The Power-Frequency Curve
Since increases with (higher voltage needed for higher frequency), performance per watt decreases at higher clocks. The most efficient operating point is the lowest voltage that sustains a given frequency.
H100 Performance per Watt vs Clock Frequency
(% of peak efficiency)For datacenter deployments where total cost includes electricity, the optimal clock frequency is below the maximum boost. Running at 80% of boost clock typically achieves 90% of peak performance at 70% of peak power — a significantly better perf/watt operating point. NVIDIA provides power capping features to enforce this.
Power Capping
Setting Power Limits
# Set GPU power limit (requires root/admin)
nvidia-smi -i 0 -pl 500
# GPU 0: power limit set to 500 W (default: 700 W)
# The GPU will reduce clock frequency to stay within 500 W
# Performance impact: approximately proportional to sqrt of power reduction
# 500/700 = 71.4% of power → ~85% of performance
# (because P ∝ V²f and performance ∝ f)
# Verify the power limit
nvidia-smi -q -d POWER -i 0
# Power Limit: 500 W
# Default Power Limit: 700 W
# Enforced Power Limit: 500 W
# Min Power Limit: 200 W
# Max Power Limit: 700 W
Power Capping Impact on Workloads
Power Cap Impact on H100 Workload Performance
| Power Cap | GEMM Throughput | LLM Inference (decode) | Memory-Bound Kernel | Efficiency (TFLOPS/W) |
|---|---|---|---|---|
| 700 W (default) | 990 TFLOPS | 100% | 100% | 1.41 |
| 600 W (86%) | 920 TFLOPS | 97% | 99% | 1.53 |
| 500 W (71%) | 820 TFLOPS | 93% | 98% | 1.64 |
| 400 W (57%) | 680 TFLOPS | 85% | 96% | 1.70 |
| 300 W (43%) | 520 TFLOPS | 72% | 93% | 1.73 |
| 200 W (29%) | 340 TFLOPS | 55% | 85% | 1.70 |
// Programmatic power management via NVML
#include <nvml.h>
void set_power_limit(unsigned int gpu_id, unsigned int watts) {
nvmlDevice_t device;
nvmlDeviceGetHandleByIndex(gpu_id, &device);
// Set power limit in milliwatts
nvmlReturn_t ret = nvmlDeviceSetPowerManagementLimit(device, watts * 1000);
if (ret == NVML_SUCCESS) {
printf("GPU %u: power limit set to %u W\n", gpu_id, watts);
}
// Query actual power draw
unsigned int power_mw;
nvmlDeviceGetPowerUsage(device, &power_mw);
printf("GPU %u: current power draw: %u mW\n", gpu_id, power_mw);
}
// Set all GPUs to 500W for maximum efficiency
void optimize_fleet_power(int num_gpus) {
nvmlInit();
for (int i = 0; i < num_gpus; i++) {
set_power_limit(i, 500);
}
}
LLM inference decode is memory-bandwidth-bound. The GPU’s compute units are mostly idle, waiting for HBM data. Reducing clock frequency (via power cap) barely affects HBM bandwidth (memory clock is independent of GPU clock). A 30% power reduction may only cause a 5-7% throughput reduction for decode — yielding a massive efficiency improvement. Compute-bound workloads (GEMM, prefill) see more proportional performance loss.
Performance per Watt Across Generations
The Historical Trend
AI Performance per Watt Across GPU Generations
| GPU | Year | FP8/FP16 TFLOPS | TDP (W) | TFLOPS/W | Improvement |
|---|---|---|---|---|---|
| P100 (FP16) | 2016 | 21.2 | 300 | 0.071 | Baseline |
| V100 (FP16 Tensor) | 2017 | 125 | 300 | 0.417 | 5.9x |
| A100 (FP16 Tensor) | 2020 | 312 | 400 | 0.780 | 11.0x |
| H100 (FP8 Tensor) | 2022 | 1,979 | 700 | 2.827 | 39.8x |
| B200 (FP8 Tensor) | 2024 | 4,500 | 1,000 | 4.500 | 63.4x |
| B200 (FP4 Tensor) | 2024 | 9,000 | 1,000 | 9.000 | 126.8x |
TFLOPS per Watt (FP8-Equivalent) Across Generations
(TFLOPS/W)Where the Efficiency Gains Come From
// Efficiency improvement decomposition (P100 → B200):
//
// 1. Tensor cores: ~10x
// P100: FP32 CUDA cores, 2 FLOPs per core per cycle
// V100+: Tensor cores, 128-1024+ FLOPs per core per cycle
// Same silicon area, 10x more useful work
//
// 2. Lower precision: ~4x
// FP32 (P100) → FP16 (V100) → FP8 (H100) → FP4 (B200)
// Each halving of precision doubles throughput at constant power
// (half the bits = half the data movement + half the multiply energy)
//
// 3. Process node improvement: ~2x
// 16nm (P100) → 12nm (V100) → 7nm (A100) → 4nm (H100/B200)
// Smaller transistors switch with less energy
// ~30-40% energy reduction per node generation
//
// 4. Architecture optimization: ~1.5x
// Better data reuse (larger on-chip buffers)
// TMA (Tensor Memory Accelerator) reduces data movement
// Transformer Engine automates precision selection
//
// Combined: 10 × 4 × 2 × 1.5 ≈ 120x (matches observed 127x)
Datacenter Power and Cooling
Power Hierarchy
// Datacenter power flow for an AI training cluster:
// Utility power (medium voltage) → Transformer → UPS → PDU → Server → GPU
//
// Losses at each stage:
// Transformer: ~2% loss
// UPS (online double-conversion): ~5-8% loss
// PDU (power distribution unit): ~2-3% loss
// Server PSU (AC-DC conversion): ~5-8% loss
// Voltage regulation modules (VRM): ~3-5% loss
//
// Power Usage Effectiveness (PUE):
// PUE = Total facility power / IT equipment power
// Best-in-class datacenter: PUE = 1.1 (10% overhead for cooling + infrastructure)
// Average datacenter: PUE = 1.3-1.5 (30-50% overhead)
//
// For a 100 MW IT load at PUE 1.2:
// Total facility power: 120 MW
// Cooling: ~15 MW
// Infrastructure (lighting, networking, storage): ~5 MW
Per-Rack Power Budgets
AI Server Rack Power Density
| System | GPUs per Rack | GPU Power | Total Rack Power | Cooling |
|---|---|---|---|---|
| DGX A100 (4 servers) | 32 A100 | 12.8 kW | ~25 kW | Air-cooled |
| DGX H100 (4 servers) | 32 H100 | 22.4 kW | ~40 kW | Liquid-cooled (GPU) |
| NVL72 (single rack) | 72 B200 | 72 kW | ~120 kW | Full liquid cooling |
| TPU v4 pod slice (rack) | 64 TPU v4 | ~11.2 kW | ~20 kW | Liquid-cooled |
Liquid Cooling Requirements
// Air cooling capacity:
// Traditional datacenter rack: 10-20 kW
// High-density air cooling: 30-40 kW (with hot/cold aisle containment)
// Beyond 40 kW: air cooling is impractical (fan power exceeds useful threshold)
// Liquid cooling options:
// 1. Rear-door heat exchanger (RDHx):
// Water loop attached to rack rear door
// Capacity: 40-80 kW per rack
// No modifications to servers — air inside rack, water at boundary
// Used by: DGX H100 (initial deployments)
// 2. Direct-to-chip (cold plate):
// Water-cooled cold plates bolted to GPU and CPU
// Capacity: 100+ kW per rack
// Requires plumbing inside the server
// Used by: NVL72, DGX GB200
// 3. Immersion cooling:
// Entire server submerged in dielectric fluid
// Capacity: 200+ kW per rack
// Extreme but effective for highest densities
// Used by: some HPC installations, experimental AI clusters
// Cost comparison (per rack):
// Air cooling: ~$5,000-10,000 (fans, ducting, CRAC)
// RDHx: ~$15,000-30,000 (water distribution, heat exchanger)
// Direct-to-chip: ~$30,000-60,000 (plumbing, CDU, manifolds)
// Immersion: ~$50,000-100,000 (tank, fluid, CDU)
The Cooling Efficiency Opportunity
Liquid cooling not only handles more heat — it is more efficient than air cooling:
// Air cooling PUE contribution: ~0.3-0.5 (adds 30-50% power for cooling)
// Water cooling PUE contribution: ~0.05-0.15 (adds 5-15% for cooling)
//
// For a 100 MW IT load:
// Air-cooled datacenter (PUE 1.4): 140 MW total, 40 MW for cooling
// Water-cooled datacenter (PUE 1.1): 110 MW total, 10 MW for cooling
// Savings: 30 MW → at $0.07/kWh = $18.4 million/year
//
// The capital cost of liquid cooling infrastructure is repaid
// by electricity savings within 1-2 years for large AI clusters
Power Optimization Strategies
Strategy 1: Workload-Aware Power Capping
# Dynamic power management for inference serving
# Reduce power during low-traffic periods
import pynvml
pynvml.nvmlInit()
def adjust_power_for_load(gpu_id, request_rate, max_rate):
handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id)
default_limit = pynvml.nvmlDeviceGetPowerManagementDefaultLimit(handle)
if request_rate < max_rate * 0.3:
# Low load: reduce to 60% power
pynvml.nvmlDeviceSetPowerManagementLimit(handle, int(default_limit * 0.6))
elif request_rate < max_rate * 0.7:
# Medium load: reduce to 80% power
pynvml.nvmlDeviceSetPowerManagementLimit(handle, int(default_limit * 0.8))
else:
# High load: full power
pynvml.nvmlDeviceSetPowerManagementLimit(handle, default_limit)
Strategy 2: Precision Selection for Efficiency
// Lower precision = fewer bit operations = less power per operation
//
// Energy per operation (approximate):
// FP32 multiply: ~3.7 pJ
// FP16 multiply: ~1.1 pJ
// FP8 multiply: ~0.4 pJ
// FP4 multiply: ~0.15 pJ
// INT8 multiply: ~0.2 pJ
//
// Moving from FP16 to FP8 reduces compute energy by ~2.75x
// Plus: half the data movement (half the bits through memory hierarchy)
// Total energy savings: ~3-4x for FP8 vs FP16
//
// This is why FP8 inference is not just faster — it's fundamentally more
// energy-efficient. The same result with ~3-4x less power per operation.
Strategy 3: Batch Size Optimization
// Batching amortizes fixed power costs (leakage, memory refresh, clock trees)
// across more useful work
//
// H100 inference power breakdown (batch=1):
// Static (leakage + memory refresh): ~250 W (fixed regardless of work)
// Dynamic (compute): ~50 W (very low — GPU mostly idle, waiting on memory)
// Total: ~300 W for 1 token's work
// Energy per token: 300 W / 25 tokens/s = 12 J/token
//
// H100 inference power breakdown (batch=64):
// Static: ~250 W (same)
// Dynamic: ~400 W (compute units heavily utilized)
// Total: ~650 W for 64 tokens' work
// Per-token throughput: ~1,600 tokens/s
// Energy per token: 650 W / 1,600 tokens/s = 0.41 J/token
//
// Batching reduces energy per token by 29x (12 / 0.41)
// This is the single most important power optimization for inference
Energy per Token vs Batch Size (H100, LLaMA-70B FP8)
(joules per token (lower is better))Strategy 4: Clock Frequency Optimization
# Find the efficiency-optimal clock frequency
# Run a sweep: reduce max clock, measure throughput and power
for FREQ in 1830 1600 1400 1200 1000; do
nvidia-smi -i 0 -lgc 210,$FREQ # Lock GPU clock range
# Run benchmark
./benchmark --model llama-70b --batch 64 --tokens 1000 > /dev/null 2>&1
# Record power and throughput
nvidia-smi -i 0 -q -d POWER | grep "Power Draw"
nvidia-smi -i 0 -q -d CLOCK | grep "Graphics"
done
nvidia-smi -i 0 -rgc # Reset clock to default
# Typical result:
# 1830 MHz: 1,600 tok/s, 680W → 2.35 tok/s/W
# 1600 MHz: 1,480 tok/s, 550W → 2.69 tok/s/W ← best efficiency
# 1400 MHz: 1,320 tok/s, 440W → 3.00 tok/s/W ← even better
# 1200 MHz: 1,100 tok/s, 350W → 3.14 tok/s/W ← diminishing returns
# 1000 MHz: 880 tok/s, 280W → 3.14 tok/s/W ← plateau
Cluster-Level Power Economics
10,000-GPU Training Cluster
// Cluster specification: 10,000 H100 GPUs
// Server configuration: 1,250 DGX H100 servers (8 GPUs each)
//
// GPU power: 10,000 × 700W = 7 MW
// Server overhead (CPU, memory, fans, PSU losses): ~3 MW
// Total IT power: ~10 MW
//
// Infrastructure (PUE 1.2):
// Cooling: ~1.5 MW
// Power distribution losses: ~0.5 MW
// Total facility power: ~12 MW
//
// Cost analysis:
// Electricity: 12 MW × $0.07/kWh × 8,760 hrs/yr = $7.36 million/year
// Hardware (GPUs): 1,250 × $200,000 = $250 million
// Hardware (networking): ~$50 million
// Datacenter (space, cooling): ~$50 million/year (lease + operations)
//
// 3-year TCO:
// Hardware: $300 million (depreciated over 3 years)
// Electricity: $22 million
// Datacenter: $150 million
// Total: $472 million
//
// Electricity is ~5% of 3-year TCO for GPU clusters
// (Hardware dominates, but electricity scales with duration)
Power Capping the Fleet
// What if we cap all 10,000 GPUs to 500W instead of 700W?
//
// Performance: ~83% of peak (memory-bound workloads barely affected)
// GPU power: 10,000 × 500W = 5 MW (vs 7 MW)
// Power saved: 2 MW
// Electricity savings: 2 MW × $0.07/kWh × 8,760 = $1.23 million/year
//
// Training time increase: ~20% (for compute-bound training)
// Training cost increase: 20% more GPU-hours
// GPU-hour cost at $2/hr: 20% × 10,000 GPUs × 2,000 training hours × $2 = $8M extra
//
// Net result: save $1.23M electricity, spend $8M more on training time
// Power capping is NOT cost-effective for training if you own the hardware
//
// BUT: if the datacenter cannot provide 10 MW (physical constraint):
// Power capping to 500W enables 40% more GPUs in the same power envelope
// 14,000 GPUs × 500W = 7 MW (same as 10,000 × 700W)
// 14,000 GPUs at 83% ≈ 11,620 "effective GPUs"
// 16.2% more effective compute in the same power budget
For training clusters, the economics rarely favor power capping — the opportunity cost of slower training exceeds the electricity savings. However, when the datacenter has a fixed power budget (transformer capacity, utility contract), power capping allows fitting more GPUs and more total compute into the same power envelope. This is the correct framing: power capping trades per-GPU performance for total-cluster performance under a power constraint.
The Inference Power Equation
For inference serving (not training), the economics are different:
// Inference is always-on (24/7/365), unlike training (burst workload)
// Electricity costs compound over years
//
// 1,000 H100 inference cluster:
// Full power (700W): 1,000 × 700W = 700 kW
// Optimized (500W, serving LLM decode): 500 kW
// Performance impact: ~5% throughput reduction (memory-bound decode)
// Electricity savings: 200 kW × $0.07/kWh × 8,760 = $122,640/year
// Throughput cost: ~5% → need ~50 more GPUs = 50 × $30,000 = $1.5M one-time
// Payback: $1.5M / $122,640 = 12.2 years (NOT worth it for electricity alone)
//
// BUT with liquid cooling savings (lower PUE):
// Air-cooled at PUE 1.4: 700 kW × 1.4 = 980 kW facility power
// Liquid-cooled at PUE 1.1 + power cap: 500 kW × 1.1 = 550 kW facility power
// Savings: 430 kW → $264K/year
// Payback for liquid cooling + power cap: ~2-3 years
Future: The Power Wall
What Limits GPU Power Growth
// B200: 1,000 W per GPU, 120 kW per NVL72 rack
// Next generation (hypothetical "R100"): ~1,200-1,500 W per GPU?
//
// Physical limits:
// 1. Power delivery: current VRM technology handles ~1,500W per module
// Beyond that: need novel power delivery (integrated voltage regulators)
// 2. Cooling: direct-to-chip liquid cooling handles ~2 kW per chip
// Beyond that: need phase-change cooling or novel materials
// 3. Datacenter infrastructure: most facilities top out at 100-150 kW per rack
// NVL72 at 120 kW is already at this limit
// Next generation may need 200+ kW per rack → custom facilities
//
// The industry response:
// - More efficient architectures (domain-specific, less general-purpose waste)
// - Lower precision (FP4, eventually binary/ternary?)
// - Better data reuse (larger on-chip buffers, smarter caching)
// - Chiplet designs (distribute power across multiple smaller dies)
// - New process nodes (2nm, 1.4nm from TSMC) with better Vdd scaling
Summary
GPU power efficiency has improved 127x from P100 to B200 (FP4), driven by tensor cores (10x), lower precision (4x), process shrinks (2x), and architectural improvements (1.5x). Despite this, absolute power consumption has increased from 300W to 1,000W per GPU, because each generation fills the power budget with more transistors.
The practical implications: (1) power capping is a tool for fitting more GPUs into a fixed power budget, not primarily for reducing electricity costs; (2) memory-bound workloads (LLM decode) benefit most from power capping because HBM bandwidth is clock-independent; (3) batching is the most effective power optimization, reducing per-token energy by 30x; (4) liquid cooling is mandatory for current-generation hardware and pays for itself through PUE reduction within 2-3 years; and (5) precision reduction (FP32 to FP8 to FP4) is simultaneously a performance optimization and a power optimization, providing 3-4x energy savings per precision halving.
For datacenter operators, the key equation is not TFLOPS/W per GPU — it is total useful work per megawatt at the facility level, including cooling, power distribution, and infrastructure. Optimizing that metric requires co-design across hardware selection, workload scheduling, power management, and cooling infrastructure.