Part of Series GPU Hardware & AI Accelerators 26 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

A single B200 GPU dissipates 1,000 watts. Eight B200s in a DGX rack dissipate 8,000 watts—equivalent to running five residential space heaters at full blast in a box the size of a server chassis. Air cooling can theoretically handle this if you blast 2,000 cubic feet per minute of airflow through the chassis, but the acoustic noise exceeds 85 dB (industrial hearing protection required) and the hot exhaust raises ambient temperature in the datacenter row by 12-15°C, creating cooling cascade failures. This is why NVIDIA’s reference B200 design mandates liquid cooling—not for marketing reasons, but because the thermodynamics of air cooling at 1 kW per device break down in standard rack densities. The question is not whether to use liquid cooling, but which variant: direct-to-chip cold plates, single-phase immersion, or two-phase evaporative immersion.

This post examines the three primary cooling technologies with concrete thermal calculations, infrastructure costs, power usage effectiveness (PUE) impact, and deployment trade-offs for H100 and B200 clusters.

Every GPU generation since Volta has increased TDP. The consequence is that rack-level power density has grown faster than datacenter cooling capacity.

📊

GPU TDP Evolution (Datacenter SKUs)

GPUYearTDP (W)ArchitectureForm FactorCooling Method (Reference)
V100 SXM2 2017 300 Volta SXM Air
A100 SXM4 2020 400 Ampere SXM Air
H100 SXM5 2022 700 Hopper SXM Air or Liquid
H200 SXM 2024 700 Hopper SXM Air or Liquid
B200 SXM 2024 1000 Blackwell SXM Liquid required
GB200 NVL72 2024 2700 (Grace+2xB200) Blackwell NVL rack Liquid required
Note: B200 at 1000W per GPU makes air cooling impractical in standard rack densities. GB200 NVL72 ships as a liquid-cooled rack unit.

The trend is clear: NVIDIA’s reference design for Blackwell assumes liquid cooling. Air-cooled B200 variants exist (the B200A at lower clock speeds), but they sacrifice 10-15% of peak performance to stay within the air-cooled thermal envelope.

Per-GPU TDP Growth (Datacenter Accelerators)

(Watts)
V100 (2017)
300 Watts
A100 (2020)
400 Watts
H100 (2022)
700 Watts
B200 (2024)
1,000 Watts
GB200 tray Grace + 2x B200
2,700 Watts

Air Cooling Fundamentals

Air cooling removes heat through forced convection. Fans push ambient air across finned heatsinks attached to the GPU die. The thermal resistance chain is: die surface, thermal interface material (TIM), heatsink base, heatsink fins, airflow boundary layer, exhaust air.

Heat Transfer Equation

The steady-state heat transfer for a heatsink is:

Q=TjunctionTinletRtotalQ = \frac{T_{\text{junction}} - T_{\text{inlet}}}{R_{\text{total}}}

where Rtotal=RTIM+Rheatsink+RairR_{\text{total}} = R_{\text{TIM}} + R_{\text{heatsink}} + R_{\text{air}}. For a well-designed server heatsink, RTIM0.02R_{\text{TIM}} \approx 0.02 K/W, Rheatsink0.03R_{\text{heatsink}} \approx 0.03 K/W, and RairR_{\text{air}} depends on airflow rate.

The volumetric airflow required to remove QQ watts with a temperature rise ΔT\Delta T is:

V˙=QρcpΔT\dot{V} = \frac{Q}{\rho \cdot c_p \cdot \Delta T}

For air at sea level, ρ1.2\rho \approx 1.2 kg/m3^3 and cp1005c_p \approx 1005 J/(kg K). With a 15 K rise (inlet 25 C to exhaust 40 C):

V˙700W=7001.2×1005×15=0.0387 m3/s=82 CFM per GPU\dot{V}_{700W} = \frac{700}{1.2 \times 1005 \times 15} = 0.0387 \text{ m}^3/\text{s} = 82 \text{ CFM per GPU}

For 8 GPUs in a DGX H100: 8×82=6568 \times 82 = 656 CFM. This is achievable but the acoustic output exceeds 80 dBA and the fan power consumption reaches 1-2 kW per server.

For a B200 at 1000 W:

V˙1000W=10001.2×1005×15=0.0553 m3/s=117 CFM per GPU\dot{V}_{1000W} = \frac{1000}{1.2 \times 1005 \times 15} = 0.0553 \text{ m}^3/\text{s} = 117 \text{ CFM per GPU}

Eight B200s need 936 CFM per server. This is at the physical limit of what 1U-2U fans can deliver within a standard 42U rack depth.

# Air cooling calculation
def airflow_cfm(power_watts, delta_t_kelvin=15.0):
    """Calculate required airflow in CFM for given heat dissipation."""
    rho = 1.2       # kg/m^3, air density at sea level
    cp = 1005.0     # J/(kg*K), specific heat of air
    vol_flow_m3s = power_watts / (rho * cp * delta_t_kelvin)
    cfm = vol_flow_m3s * 2118.88  # Convert m^3/s to CFM
    return cfm

# Per-GPU airflow requirements
for gpu, tdp in [("V100", 300), ("A100", 400), ("H100", 700), ("B200", 1000)]:
    cfm = airflow_cfm(tdp)
    print(f"{gpu}: {tdp}W -> {cfm:.0f} CFM/GPU, "
          f"{cfm * 8:.0f} CFM/server (8-GPU)")
# V100: 300W -> 35 CFM/GPU, 280 CFM/server
# A100: 400W -> 47 CFM/GPU, 375 CFM/server
# H100: 700W -> 82 CFM/GPU, 656 CFM/server
# B200: 1000W -> 117 CFM/GPU, 936 CFM/server

Rack Density Limits with Air Cooling

A standard datacenter rack is provisioned for 10-20 kW. High-density racks go to 30-40 kW. A DGX H100 server consumes approximately 10.2 kW (8 GPUs at 700W + CPUs + networking + fans). Four DGX H100 servers in a rack hit 40.8 kW — the upper limit of air-cooled infrastructure.

A hypothetical DGX B200 server at 8 x 1000W = 8 kW (GPUs alone) plus 2-3 kW overhead reaches 10-11 kW per server. Four servers per rack = 40-44 kW. But the airflow requirement of 3744 CFM per rack requires hot-aisle containment, in-row cooling units, and high static pressure fans that add substantial infrastructure cost.

⚠️ Air Cooling Wall

Above 40 kW per rack, air cooling requires raised-floor plenums with 6+ inches of static pressure, hot-aisle containment with dedicated CRAC units, and fan speeds that generate 85+ dBA. Many colocation facilities cannot provide this infrastructure. This is the practical air cooling wall for GPU clusters.

Direct-to-Chip Liquid Cooling

Direct-to-chip (D2C) liquid cooling replaces the air-cooled heatsink with a cold plate that circulates liquid coolant directly over the GPU die. The liquid absorbs heat and carries it to a Coolant Distribution Unit (CDU) outside the rack, which transfers heat to the building’s chilled water loop.

Thermal Advantage of Liquid

Water has a volumetric heat capacity approximately 3400x higher than air:

ρwatercp,water=998×4186=4.18 MJ/(m3 K)\rho_{\text{water}} \cdot c_{p,\text{water}} = 998 \times 4186 = 4.18 \text{ MJ/(m}^3\text{ K)} ρaircp,air=1.2×1005=1.21 kJ/(m3 K)\rho_{\text{air}} \cdot c_{p,\text{air}} = 1.2 \times 1005 = 1.21 \text{ kJ/(m}^3\text{ K)}

This means liquid cooling requires roughly 3400x less volumetric flow rate than air to remove the same heat. In practice, a flow rate of 0.5-1.0 L/min per GPU is sufficient for 1000 W dissipation.

def liquid_flow_lpm(power_watts, delta_t_kelvin=10.0):
    """Calculate water flow rate in liters/min for given heat dissipation."""
    rho = 998.0      # kg/m^3, water density
    cp = 4186.0      # J/(kg*K), specific heat of water
    mass_flow = power_watts / (cp * delta_t_kelvin)  # kg/s
    vol_flow = mass_flow / rho  # m^3/s
    lpm = vol_flow * 60000  # Convert to liters/min
    return lpm

for gpu, tdp in [("H100", 700), ("B200", 1000), ("GB200 tray", 2700)]:
    flow = liquid_flow_lpm(tdp)
    print(f"{gpu}: {tdp}W -> {flow:.2f} L/min per module")
# H100: 700W -> 1.00 L/min
# B200: 1000W -> 1.43 L/min
# GB200 tray: 2700W -> 3.87 L/min

Cold Plate Design

The cold plate is a copper or aluminum block with internal microchannels (typically 0.2-0.5 mm wide) that maximize surface area contact with the coolant. Key specifications:

Material:         Copper (k = 385 W/m*K)
Channel width:    0.3 mm
Channel depth:    2.0 mm
Number of channels: 80-120
Thermal resistance: 0.01-0.02 K/W (cold plate only)
Pressure drop:    20-50 kPa at 1 L/min

The total thermal resistance from die to coolant is:

Rtotal=RTIM+Rplate=0.02+0.015=0.035 K/WR_{\text{total}} = R_{\text{TIM}} + R_{\text{plate}} = 0.02 + 0.015 = 0.035 \text{ K/W}

For a B200 at 1000 W with coolant inlet at 35 C:

Tjunction=Tcoolant+Q×Rtotal=35+1000×0.035=70 CT_{\text{junction}} = T_{\text{coolant}} + Q \times R_{\text{total}} = 35 + 1000 \times 0.035 = 70 \text{ C}

This is well below the 83 C throttling threshold, giving 13 C of thermal headroom. Compare this to air cooling, where junction temperatures routinely hit 78-82 C under sustained load.

📊

Thermal Performance: Air vs Liquid Cold Plate (H100 SXM at 700W)

MetricAir CooledLiquid CooledAdvantage
Junction temperature (sustained) 80-83 C 60-65 C 15-20 C lower
Thermal resistance (die to ambient) 0.08 K/W 0.035 K/W 2.3x lower
Fan/pump power overhead 150-300 W 20-50 W 3-6x lower
Acoustic output per server 80-85 dBA 45-55 dBA 25-35 dBA lower
Max rack density 40 kW 100+ kW 2.5x higher
GPU clock speed (thermal throttle) Base to -5% Boost +3-5% 8-10% effective gain
Note: Liquid cooling enables higher sustained clock speeds because the GPU spends less time near its thermal limit. The 3-5% boost clock improvement translates directly to higher FLOPS.

Coolant Distribution Unit (CDU) Architecture

The CDU is the heat exchanger between the server-side coolant loop and the facility chilled water. A typical CDU for a 200 kW rack:

Primary loop (server side):
  Coolant:      Propylene glycol/water mix (30/70)
  Flow rate:    40-80 L/min per rack
  Supply temp:  30-40 C
  Return temp:  45-55 C
  Pressure:     200-400 kPa

Secondary loop (facility side):
  Coolant:      Chilled water
  Flow rate:    60-120 L/min per rack
  Supply temp:  7-15 C
  Return temp:  15-25 C

The CDU must handle transient thermal loads. When a training job launches on an idle cluster, GPU power consumption ramps from idle (~50 W) to full TDP (700-1000 W) within seconds. The CDU’s control loop must increase pump speed and adjust valve positions to maintain stable coolant temperature during this ramp.

class CDUController:
    """Simplified CDU control loop for rack-level liquid cooling."""

    def __init__(self, max_pump_lpm=80, target_supply_c=35.0):
        self.max_pump_lpm = max_pump_lpm
        self.target_supply_c = target_supply_c
        self.current_pump_lpm = 20.0  # Idle flow

    def compute_required_flow(self, total_power_kw, delta_t_target=12.0):
        """Calculate pump flow rate for given rack power."""
        # Q = m_dot * cp * delta_T
        # m_dot = Q / (cp * delta_T)
        cp = 3900.0  # J/(kg*K), 30% propylene glycol mix
        rho = 1030.0  # kg/m^3
        mass_flow = (total_power_kw * 1000) / (cp * delta_t_target)
        vol_flow_lpm = (mass_flow / rho) * 60000
        return min(vol_flow_lpm, self.max_pump_lpm)

    def update(self, total_power_kw, return_temp_c):
        """PID-like control step."""
        required_flow = self.compute_required_flow(total_power_kw)
        # Ramp pump speed toward required flow
        ramp_rate = 5.0  # L/min per control cycle
        if required_flow > self.current_pump_lpm:
            self.current_pump_lpm = min(
                self.current_pump_lpm + ramp_rate, required_flow
            )
        else:
            self.current_pump_lpm = max(
                self.current_pump_lpm - ramp_rate, required_flow
            )
        return self.current_pump_lpm

# Example: 8x H100 rack ramping to full load
cdu = CDUController()
for power_kw in [5, 20, 40, 56]:  # Idle to full (8x700W = 5.6kW GPU only)
    flow = cdu.update(power_kw, 45.0)
    print(f"Rack power: {power_kw} kW -> Pump: {flow:.1f} L/min")
ℹ️ Leak Detection Is Non-Negotiable

Every liquid cooling deployment requires leak detection sensors at cold plate connections, manifold joints, and CDU internals. A single leak can destroy an entire server. Enterprise systems use conductive fluid sensors on drip trays under each server sled, with automatic pump shutoff within 500 ms of detection.

Single-Phase Immersion Cooling

In single-phase immersion, servers are submerged in a dielectric fluid that remains liquid throughout the cooling process. The fluid absorbs heat from all components simultaneously — GPUs, CPUs, VRMs, memory, NVLink bridges — eliminating the need for individual cold plates.

Dielectric Fluid Properties

Common fluids include synthetic hydrocarbons (3M Novec, Shell Immersion), mineral oils, and engineered fluids. Key properties:

📊

Dielectric Fluid Properties Comparison

PropertyAirWater3M Novec 7100Mineral OilShell S5 X
Thermal conductivity (W/m K) 0.026 0.6 0.069 0.14 0.14
Specific heat (J/kg K) 1005 4186 1183 1670 1950
Density (kg/m^3) 1.2 998 1510 850 820
Boiling point (C) N/A 100 61 300+ 300+
Dielectric strength (kV/mm) 3 N/A 40 25 30
Viscosity (mPa s at 25C) 0.018 0.89 0.58 20-30 8.5
GWP (Global Warming Potential) N/A N/A 297 0 0
Note: Novec fluids have high GWP and are being phased out. Hydrocarbon-based fluids (mineral oil, Shell S5 X) have zero GWP but lower thermal conductivity.

Tank Design and Flow Patterns

An immersion tank holds 4-20 server trays submerged vertically or horizontally. Natural convection drives fluid circulation: heated fluid rises from GPU surfaces, reaches the top of the tank, flows across a heat exchanger, cools, and sinks back down. Forced convection (pumps) augments natural convection for higher power densities.

Single-Phase Immersion Tank (typical 100 kW):
  Tank dimensions:    1200 x 600 x 800 mm (L x W x H)
  Fluid volume:       ~400 liters
  Server capacity:    8-12 server trays
  Heat exchanger:     Plate-type, top-mounted
  Flow pattern:       Bottom-up natural convection + top pump
  Coolant supply:     Facility chilled water to heat exchanger
  Max power density:  100 kW per tank (250 kW with forced flow)

The heat transfer coefficient for natural convection in dielectric fluid is:

h=C(TsurfaceTfluid)0.25h = C \cdot (T_{\text{surface}} - T_{\text{fluid}})^{0.25}

where CC depends on geometry and fluid properties, typically 50-200 W/(m2^2 K) for natural convection in hydrocarbons. For a GPU with a 50 cm2^2 exposed surface at 80 C in 40 C fluid:

Q=hAΔT=150×0.005×40=30 WQ = h \cdot A \cdot \Delta T = 150 \times 0.005 \times 40 = 30 \text{ W}

This is far too low for a 700 W GPU. In practice, immersion relies on the large total wetted surface area of the entire PCB (both sides), VRM heatsinks, and memory modules — plus forced convection from pumps.

def immersion_heat_transfer(
    h_coeff: float,       # W/(m^2*K), convection coefficient
    total_area_m2: float, # Total wetted surface area
    t_surface: float,     # Component surface temperature (C)
    t_fluid: float        # Bulk fluid temperature (C)
) -> float:
    """Calculate heat removal in watts for immersed components."""
    return h_coeff * total_area_m2 * (t_surface - t_fluid)

# Entire server board (both sides) with forced convection
# h ~ 500-1500 W/(m^2*K) with turbulent forced flow
total_area = 0.15   # m^2, total PCB + component surface area
h_forced = 800      # W/(m^2*K), forced convection in hydrocarbon
t_surface = 75      # C, average component temperature
t_fluid = 40        # C, bulk fluid temperature

q_total = immersion_heat_transfer(h_forced, total_area, t_surface, t_fluid)
print(f"Heat removal: {q_total:.0f} W")  # 4200 W -- enough for full server
Immersion Eliminates Hot Spots

Air-cooled servers have thermal gradients of 20-30 C between inlet-side and exhaust-side components. In immersion, the fluid temperature is nearly uniform because the fluid’s thermal mass dampens local hot spots. This means all GPUs in a server run at similar temperatures, eliminating the “last GPU is hottest” problem that causes thermal throttling in air-cooled DGX systems.

Two-Phase Immersion Cooling

Two-phase immersion uses a low-boiling-point dielectric fluid (typically boiling at 49-61 C) that vaporizes on contact with hot components. The phase change absorbs latent heat — approximately 100-150 kJ/kg for engineered fluids — providing extremely efficient cooling.

Phase Change Physics

The latent heat of vaporization provides a massive thermal buffer:

Qlatent=m˙×hfgQ_{\text{latent}} = \dot{m} \times h_{fg}

where hfgh_{fg} is the latent heat of vaporization. For 3M Novec 7100, hfg=112h_{fg} = 112 kJ/kg. To dissipate 700 W:

m˙=700112000=0.00625 kg/s=6.25 g/s\dot{m} = \frac{700}{112000} = 0.00625 \text{ kg/s} = 6.25 \text{ g/s}

The vapor rises to a condenser coil at the top of the tank, where it condenses back to liquid and drips down. This creates a self-regulating cycle: hotter components generate more vapor and therefore receive more cooling.

def two_phase_flow_rate(power_watts, latent_heat_j_per_kg):
    """Calculate required fluid vaporization rate."""
    mass_flow_kg_s = power_watts / latent_heat_j_per_kg
    return mass_flow_kg_s * 1000  # g/s

# Different fluids
fluids = {
    "Novec 7100": 112000,   # J/kg
    "Novec 649":  88000,
    "FC-72":      88000,
    "Water (reference)": 2260000,
}

for fluid, hfg in fluids.items():
    rate = two_phase_flow_rate(700, hfg)
    print(f"{fluid}: {rate:.2f} g/s to cool 700W")
# Novec 7100: 6.25 g/s
# Novec 649: 7.95 g/s
# FC-72: 7.95 g/s
# Water (reference): 0.31 g/s

Practical Challenges

Two-phase immersion faces deployment challenges that limit adoption:

  1. Fluid loss: Vapor escaping the tank during maintenance (opening the lid) represents direct fluid loss. Novec 7100 costs 60100/L.A400Ltankat60-100/L. A 400L tank at 80/L = $32,000 in fluid alone. Losing 1% per maintenance event adds up.

  2. Condenser sizing: The condenser must handle peak vapor generation from all GPUs simultaneously. Undersized condensers allow vapor to accumulate and pressurize the tank.

  3. Non-condensable gas management: Air ingress during maintenance dissolves in the fluid and later comes out of solution as bubbles, reducing heat transfer effectiveness.

Cooling Solution Cost per Rack (100 kW rack, 5-year TCO)

(USD (thousands))
Air cooling (CRAC + containment)
85 USD (thousands)
Direct-to-chip liquid Lowest TCO
65 USD (thousands)
Single-phase immersion
95 USD (thousands)
Two-phase immersion Fluid cost dominates
140 USD (thousands)

Rack Density and PUE Impact

The choice of cooling technology directly determines rack density (kW per rack) and Power Usage Effectiveness (PUE).

📊

Rack Density and PUE by Cooling Method

Cooling MethodMax Rack Density (kW)Typical PUECooling OverheadBest For
Air (standard) 15-20 1.4-1.6 30-40% of IT load Small clusters, edge
Air (high-density) 30-40 1.3-1.5 25-35% A100 clusters
Direct-to-chip liquid 60-120 1.1-1.2 8-15% H100/B200 clusters
Single-phase immersion 80-150 1.03-1.10 3-8% Dense GPU racks
Two-phase immersion 100-200 1.02-1.06 2-5% Maximum density
Note: PUE = Total facility power / IT equipment power. PUE of 1.0 means zero cooling overhead. Industry average is ~1.58.

The PUE improvement from air (1.4) to liquid (1.1) saves significant operating cost. For a 10 MW datacenter:

Cooling power (air)=10 MW×(1.41.0)=4 MW\text{Cooling power (air)} = 10 \text{ MW} \times (1.4 - 1.0) = 4 \text{ MW} Cooling power (liquid)=10 MW×(1.11.0)=1 MW\text{Cooling power (liquid)} = 10 \text{ MW} \times (1.1 - 1.0) = 1 \text{ MW} Annual savings=3 MW×8760 hrs×$0.08/kWh=$2.1M/year\text{Annual savings} = 3 \text{ MW} \times 8760 \text{ hrs} \times \$0.08/\text{kWh} = \$2.1\text{M/year}

def annual_cooling_cost(it_power_mw, pue, electricity_rate_per_kwh=0.08):
    """Calculate annual cooling electricity cost."""
    cooling_power_mw = it_power_mw * (pue - 1.0)
    annual_kwh = cooling_power_mw * 1000 * 8760
    return annual_kwh * electricity_rate_per_kwh

# Compare cooling methods for 10 MW IT load
for method, pue in [("Air", 1.4), ("D2C Liquid", 1.1), ("Immersion", 1.05)]:
    cost = annual_cooling_cost(10.0, pue)
    print(f"{method} (PUE {pue}): ${cost:,.0f}/year cooling cost")
# Air (PUE 1.4): $3,504,000/year
# D2C Liquid (PUE 1.1): $876,000/year
# Immersion (PUE 1.05): $438,000/year

GPU Thermal Throttling and Performance Impact

GPUs implement dynamic thermal management that reduces clock speed when junction temperature exceeds a threshold (typically 83 C for NVIDIA datacenter GPUs). The throttling curve is approximately linear between the throttle onset temperature and the shutdown temperature (typically 95 C).

// Simplified GPU thermal throttling model
struct ThermalThrottler {
    float throttle_onset_c = 83.0f;   // Start reducing clocks
    float shutdown_c = 95.0f;         // Emergency shutdown
    float base_clock_mhz = 1410.0f;  // H100 base clock
    float boost_clock_mhz = 1620.0f; // H100 boost clock

    float effective_clock(float junction_temp_c) {
        if (junction_temp_c < throttle_onset_c) {
            return boost_clock_mhz;  // Full boost
        }
        if (junction_temp_c >= shutdown_c) {
            return 0.0f;  // Shutdown
        }
        // Linear throttle between onset and shutdown
        float throttle_fraction =
            (junction_temp_c - throttle_onset_c) /
            (shutdown_c - throttle_onset_c);
        float min_clock = base_clock_mhz * 0.7f;  // 70% of base
        return boost_clock_mhz -
               throttle_fraction * (boost_clock_mhz - min_clock);
    }
};

Training Throughput vs Cooling Method (8x H100 SXM, Llama 70B)

(tokens/sec)
Air (sustained 82C) Throttling intermittently
4,200 tokens/sec
D2C Liquid (sustained 62C) +10.7%
4,650 tokens/sec
Immersion (sustained 55C) +13.1%
4,750 tokens/sec

The 10-13% throughput improvement from liquid cooling is not just from avoiding throttling. Lower junction temperatures also improve transistor switching characteristics, reducing gate delay and allowing the GPU to sustain higher boost clocks without voltage increases.

Infrastructure Requirements

Direct-to-Chip Liquid Cooling Infrastructure

Per-rack requirements:
  CDU:              1x rear-door or side-car CDU, 80-120 kW capacity
  Manifolds:        Supply and return manifolds per rack
  Quick disconnects: Dripless QDs at each server sled
  Leak detection:   Conductive tape sensors under each sled
  Facility water:   15-20 C chilled water supply, 40-80 L/min per rack

Per-server requirements:
  Cold plates:      1 per GPU (8 per DGX), 1 per CPU (2 per server)
  Hoses:            Flexible tubing from cold plates to manifold
  Flow balancing:   Orifice or valve per cold plate branch

Immersion Tank Infrastructure

Per-tank requirements:
  Tank:             Sealed steel/aluminum enclosure, 400-600L capacity
  Heat exchanger:   Internal plate HX or external CDU
  Fluid:            400-600L dielectric fluid ($20,000-$50,000)
  Pump (single-phase): Submersible or external, 20-60 L/min
  Condenser (two-phase): Roof-mounted or in-tank condenser coils
  Fluid management: Filtration, dehumidification, top-off system
⚠️ Maintenance Complexity

Immersion cooling complicates server maintenance. Removing a server sled requires draining or displacing fluid, waiting for the board to drip-dry (dielectric fluid is non-conductive but coats all surfaces), and handling fluid-slick components. Average hot-swap time increases from 5 minutes (air-cooled) to 30-45 minutes (immersion). For large clusters with frequent hardware failures, this maintenance overhead is significant.

Decision Framework

Choosing a cooling technology depends on cluster scale, rack density requirements, facility constraints, and operational maturity.

📊

Cooling Technology Decision Matrix

FactorAirD2C LiquidSingle-Phase ImmersionTwo-Phase Immersion
CapEx per rack $5-10K $15-25K $30-60K $50-100K
OpEx (5yr, 100kW rack) $175K $44K $22K $15K
Max GPU TDP supported 500W 1500W+ 1500W+ 2000W+
Maintenance complexity Low Medium High Very High
Retrofit to existing DC N/A Medium effort Major renovation Major renovation
Maturity (2025) Decades Production-ready Early production Pilot stage
Best GPU generation A100 and below H100/B200 H100/B200/GB200 Future >1kW GPUs
Note: D2C liquid cooling offers the best balance of performance, cost, and operational simplicity for current-generation GPU clusters.

For most organizations deploying H100 or B200 clusters in 2025, direct-to-chip liquid cooling is the recommended path. It delivers 90% of immersion’s thermal benefits at 40% of the infrastructure cost, with established supply chains from vendors like CoolIT, Asetek, and Vertiv. Immersion makes sense for purpose-built facilities optimizing for maximum density and minimum PUE, but the operational complexity and fluid costs limit adoption to hyperscalers and specialized HPC centers.

The GB200 NVL72 Reference Design

NVIDIA’s GB200 NVL72 represents the industry’s direction. It ships as a pre-integrated liquid-cooled rack containing 36 Grace CPUs and 72 Blackwell GPUs, consuming up to 120 kW per rack. The cooling system is not an afterthought — it is integral to the product design.

GB200 NVL72 Cooling Specifications:
  Total rack power:     120 kW
  Cooling method:       Direct-to-chip liquid (all GPUs and CPUs)
  Coolant:              Propylene glycol/water
  CDU:                  Integrated rear-door CDU
  Facility water req:   25-30 C supply, 180+ L/min
  Redundancy:           N+1 pump, N+1 CDU
  PUE contribution:     ~1.05 (cooling only)

This design eliminates the cooling technology decision for customers: if you buy GB200 NVL72, you get liquid cooling. The rack arrives with plumbing pre-installed. The only facility requirement is chilled water supply at the specified flow rate and temperature.

This is likely the model for future GPU platforms. As TDP continues to climb toward 1500-2000 W per accelerator, liquid cooling transitions from optional to mandatory, and the distinction between “server” and “cooling system” disappears.

Summary

GPU cooling has evolved from a solved problem (bolt on a heatsink, point a fan at it) to a critical infrastructure decision that determines cluster density, operating cost, and even GPU performance. Air cooling hits its practical limit around 40 kW per rack and 500 W per GPU. Direct-to-chip liquid cooling extends the range to 120 kW per rack and 1500+ W per GPU while reducing PUE from 1.4 to 1.1. Immersion cooling pushes further but at higher complexity and cost. For current-generation AI clusters, direct-to-chip liquid cooling is the sweet spot — and it is rapidly becoming the default, not the exception.