A single B200 GPU dissipates 1,000 watts. Eight B200s in a DGX rack dissipate 8,000 watts—equivalent to running five residential space heaters at full blast in a box the size of a server chassis. Air cooling can theoretically handle this if you blast 2,000 cubic feet per minute of airflow through the chassis, but the acoustic noise exceeds 85 dB (industrial hearing protection required) and the hot exhaust raises ambient temperature in the datacenter row by 12-15°C, creating cooling cascade failures. This is why NVIDIA’s reference B200 design mandates liquid cooling—not for marketing reasons, but because the thermodynamics of air cooling at 1 kW per device break down in standard rack densities. The question is not whether to use liquid cooling, but which variant: direct-to-chip cold plates, single-phase immersion, or two-phase evaporative immersion.
This post examines the three primary cooling technologies with concrete thermal calculations, infrastructure costs, power usage effectiveness (PUE) impact, and deployment trade-offs for H100 and B200 clusters.
Thermal Design Power Trends
Every GPU generation since Volta has increased TDP. The consequence is that rack-level power density has grown faster than datacenter cooling capacity.
GPU TDP Evolution (Datacenter SKUs)
| GPU | Year | TDP (W) | Architecture | Form Factor | Cooling Method (Reference) |
|---|---|---|---|---|---|
| V100 SXM2 | 2017 | 300 | Volta | SXM | Air |
| A100 SXM4 | 2020 | 400 | Ampere | SXM | Air |
| H100 SXM5 | 2022 | 700 | Hopper | SXM | Air or Liquid |
| H200 SXM | 2024 | 700 | Hopper | SXM | Air or Liquid |
| B200 SXM | 2024 | 1000 | Blackwell | SXM | Liquid required |
| GB200 NVL72 | 2024 | 2700 (Grace+2xB200) | Blackwell | NVL rack | Liquid required |
The trend is clear: NVIDIA’s reference design for Blackwell assumes liquid cooling. Air-cooled B200 variants exist (the B200A at lower clock speeds), but they sacrifice 10-15% of peak performance to stay within the air-cooled thermal envelope.
Per-GPU TDP Growth (Datacenter Accelerators)
(Watts)Air Cooling Fundamentals
Air cooling removes heat through forced convection. Fans push ambient air across finned heatsinks attached to the GPU die. The thermal resistance chain is: die surface, thermal interface material (TIM), heatsink base, heatsink fins, airflow boundary layer, exhaust air.
Heat Transfer Equation
The steady-state heat transfer for a heatsink is:
where . For a well-designed server heatsink, K/W, K/W, and depends on airflow rate.
The volumetric airflow required to remove watts with a temperature rise is:
For air at sea level, kg/m and J/(kg K). With a 15 K rise (inlet 25 C to exhaust 40 C):
For 8 GPUs in a DGX H100: CFM. This is achievable but the acoustic output exceeds 80 dBA and the fan power consumption reaches 1-2 kW per server.
For a B200 at 1000 W:
Eight B200s need 936 CFM per server. This is at the physical limit of what 1U-2U fans can deliver within a standard 42U rack depth.
# Air cooling calculation
def airflow_cfm(power_watts, delta_t_kelvin=15.0):
"""Calculate required airflow in CFM for given heat dissipation."""
rho = 1.2 # kg/m^3, air density at sea level
cp = 1005.0 # J/(kg*K), specific heat of air
vol_flow_m3s = power_watts / (rho * cp * delta_t_kelvin)
cfm = vol_flow_m3s * 2118.88 # Convert m^3/s to CFM
return cfm
# Per-GPU airflow requirements
for gpu, tdp in [("V100", 300), ("A100", 400), ("H100", 700), ("B200", 1000)]:
cfm = airflow_cfm(tdp)
print(f"{gpu}: {tdp}W -> {cfm:.0f} CFM/GPU, "
f"{cfm * 8:.0f} CFM/server (8-GPU)")
# V100: 300W -> 35 CFM/GPU, 280 CFM/server
# A100: 400W -> 47 CFM/GPU, 375 CFM/server
# H100: 700W -> 82 CFM/GPU, 656 CFM/server
# B200: 1000W -> 117 CFM/GPU, 936 CFM/server
Rack Density Limits with Air Cooling
A standard datacenter rack is provisioned for 10-20 kW. High-density racks go to 30-40 kW. A DGX H100 server consumes approximately 10.2 kW (8 GPUs at 700W + CPUs + networking + fans). Four DGX H100 servers in a rack hit 40.8 kW — the upper limit of air-cooled infrastructure.
A hypothetical DGX B200 server at 8 x 1000W = 8 kW (GPUs alone) plus 2-3 kW overhead reaches 10-11 kW per server. Four servers per rack = 40-44 kW. But the airflow requirement of 3744 CFM per rack requires hot-aisle containment, in-row cooling units, and high static pressure fans that add substantial infrastructure cost.
Above 40 kW per rack, air cooling requires raised-floor plenums with 6+ inches of static pressure, hot-aisle containment with dedicated CRAC units, and fan speeds that generate 85+ dBA. Many colocation facilities cannot provide this infrastructure. This is the practical air cooling wall for GPU clusters.
Direct-to-Chip Liquid Cooling
Direct-to-chip (D2C) liquid cooling replaces the air-cooled heatsink with a cold plate that circulates liquid coolant directly over the GPU die. The liquid absorbs heat and carries it to a Coolant Distribution Unit (CDU) outside the rack, which transfers heat to the building’s chilled water loop.
Thermal Advantage of Liquid
Water has a volumetric heat capacity approximately 3400x higher than air:
This means liquid cooling requires roughly 3400x less volumetric flow rate than air to remove the same heat. In practice, a flow rate of 0.5-1.0 L/min per GPU is sufficient for 1000 W dissipation.
def liquid_flow_lpm(power_watts, delta_t_kelvin=10.0):
"""Calculate water flow rate in liters/min for given heat dissipation."""
rho = 998.0 # kg/m^3, water density
cp = 4186.0 # J/(kg*K), specific heat of water
mass_flow = power_watts / (cp * delta_t_kelvin) # kg/s
vol_flow = mass_flow / rho # m^3/s
lpm = vol_flow * 60000 # Convert to liters/min
return lpm
for gpu, tdp in [("H100", 700), ("B200", 1000), ("GB200 tray", 2700)]:
flow = liquid_flow_lpm(tdp)
print(f"{gpu}: {tdp}W -> {flow:.2f} L/min per module")
# H100: 700W -> 1.00 L/min
# B200: 1000W -> 1.43 L/min
# GB200 tray: 2700W -> 3.87 L/min
Cold Plate Design
The cold plate is a copper or aluminum block with internal microchannels (typically 0.2-0.5 mm wide) that maximize surface area contact with the coolant. Key specifications:
Material: Copper (k = 385 W/m*K)
Channel width: 0.3 mm
Channel depth: 2.0 mm
Number of channels: 80-120
Thermal resistance: 0.01-0.02 K/W (cold plate only)
Pressure drop: 20-50 kPa at 1 L/min
The total thermal resistance from die to coolant is:
For a B200 at 1000 W with coolant inlet at 35 C:
This is well below the 83 C throttling threshold, giving 13 C of thermal headroom. Compare this to air cooling, where junction temperatures routinely hit 78-82 C under sustained load.
Thermal Performance: Air vs Liquid Cold Plate (H100 SXM at 700W)
| Metric | Air Cooled | Liquid Cooled | Advantage |
|---|---|---|---|
| Junction temperature (sustained) | 80-83 C | 60-65 C | 15-20 C lower |
| Thermal resistance (die to ambient) | 0.08 K/W | 0.035 K/W | 2.3x lower |
| Fan/pump power overhead | 150-300 W | 20-50 W | 3-6x lower |
| Acoustic output per server | 80-85 dBA | 45-55 dBA | 25-35 dBA lower |
| Max rack density | 40 kW | 100+ kW | 2.5x higher |
| GPU clock speed (thermal throttle) | Base to -5% | Boost +3-5% | 8-10% effective gain |
Coolant Distribution Unit (CDU) Architecture
The CDU is the heat exchanger between the server-side coolant loop and the facility chilled water. A typical CDU for a 200 kW rack:
Primary loop (server side):
Coolant: Propylene glycol/water mix (30/70)
Flow rate: 40-80 L/min per rack
Supply temp: 30-40 C
Return temp: 45-55 C
Pressure: 200-400 kPa
Secondary loop (facility side):
Coolant: Chilled water
Flow rate: 60-120 L/min per rack
Supply temp: 7-15 C
Return temp: 15-25 C
The CDU must handle transient thermal loads. When a training job launches on an idle cluster, GPU power consumption ramps from idle (~50 W) to full TDP (700-1000 W) within seconds. The CDU’s control loop must increase pump speed and adjust valve positions to maintain stable coolant temperature during this ramp.
class CDUController:
"""Simplified CDU control loop for rack-level liquid cooling."""
def __init__(self, max_pump_lpm=80, target_supply_c=35.0):
self.max_pump_lpm = max_pump_lpm
self.target_supply_c = target_supply_c
self.current_pump_lpm = 20.0 # Idle flow
def compute_required_flow(self, total_power_kw, delta_t_target=12.0):
"""Calculate pump flow rate for given rack power."""
# Q = m_dot * cp * delta_T
# m_dot = Q / (cp * delta_T)
cp = 3900.0 # J/(kg*K), 30% propylene glycol mix
rho = 1030.0 # kg/m^3
mass_flow = (total_power_kw * 1000) / (cp * delta_t_target)
vol_flow_lpm = (mass_flow / rho) * 60000
return min(vol_flow_lpm, self.max_pump_lpm)
def update(self, total_power_kw, return_temp_c):
"""PID-like control step."""
required_flow = self.compute_required_flow(total_power_kw)
# Ramp pump speed toward required flow
ramp_rate = 5.0 # L/min per control cycle
if required_flow > self.current_pump_lpm:
self.current_pump_lpm = min(
self.current_pump_lpm + ramp_rate, required_flow
)
else:
self.current_pump_lpm = max(
self.current_pump_lpm - ramp_rate, required_flow
)
return self.current_pump_lpm
# Example: 8x H100 rack ramping to full load
cdu = CDUController()
for power_kw in [5, 20, 40, 56]: # Idle to full (8x700W = 5.6kW GPU only)
flow = cdu.update(power_kw, 45.0)
print(f"Rack power: {power_kw} kW -> Pump: {flow:.1f} L/min")
Every liquid cooling deployment requires leak detection sensors at cold plate connections, manifold joints, and CDU internals. A single leak can destroy an entire server. Enterprise systems use conductive fluid sensors on drip trays under each server sled, with automatic pump shutoff within 500 ms of detection.
Single-Phase Immersion Cooling
In single-phase immersion, servers are submerged in a dielectric fluid that remains liquid throughout the cooling process. The fluid absorbs heat from all components simultaneously — GPUs, CPUs, VRMs, memory, NVLink bridges — eliminating the need for individual cold plates.
Dielectric Fluid Properties
Common fluids include synthetic hydrocarbons (3M Novec, Shell Immersion), mineral oils, and engineered fluids. Key properties:
Dielectric Fluid Properties Comparison
| Property | Air | Water | 3M Novec 7100 | Mineral Oil | Shell S5 X |
|---|---|---|---|---|---|
| Thermal conductivity (W/m K) | 0.026 | 0.6 | 0.069 | 0.14 | 0.14 |
| Specific heat (J/kg K) | 1005 | 4186 | 1183 | 1670 | 1950 |
| Density (kg/m^3) | 1.2 | 998 | 1510 | 850 | 820 |
| Boiling point (C) | N/A | 100 | 61 | 300+ | 300+ |
| Dielectric strength (kV/mm) | 3 | N/A | 40 | 25 | 30 |
| Viscosity (mPa s at 25C) | 0.018 | 0.89 | 0.58 | 20-30 | 8.5 |
| GWP (Global Warming Potential) | N/A | N/A | 297 | 0 | 0 |
Tank Design and Flow Patterns
An immersion tank holds 4-20 server trays submerged vertically or horizontally. Natural convection drives fluid circulation: heated fluid rises from GPU surfaces, reaches the top of the tank, flows across a heat exchanger, cools, and sinks back down. Forced convection (pumps) augments natural convection for higher power densities.
Single-Phase Immersion Tank (typical 100 kW):
Tank dimensions: 1200 x 600 x 800 mm (L x W x H)
Fluid volume: ~400 liters
Server capacity: 8-12 server trays
Heat exchanger: Plate-type, top-mounted
Flow pattern: Bottom-up natural convection + top pump
Coolant supply: Facility chilled water to heat exchanger
Max power density: 100 kW per tank (250 kW with forced flow)
The heat transfer coefficient for natural convection in dielectric fluid is:
where depends on geometry and fluid properties, typically 50-200 W/(m K) for natural convection in hydrocarbons. For a GPU with a 50 cm exposed surface at 80 C in 40 C fluid:
This is far too low for a 700 W GPU. In practice, immersion relies on the large total wetted surface area of the entire PCB (both sides), VRM heatsinks, and memory modules — plus forced convection from pumps.
def immersion_heat_transfer(
h_coeff: float, # W/(m^2*K), convection coefficient
total_area_m2: float, # Total wetted surface area
t_surface: float, # Component surface temperature (C)
t_fluid: float # Bulk fluid temperature (C)
) -> float:
"""Calculate heat removal in watts for immersed components."""
return h_coeff * total_area_m2 * (t_surface - t_fluid)
# Entire server board (both sides) with forced convection
# h ~ 500-1500 W/(m^2*K) with turbulent forced flow
total_area = 0.15 # m^2, total PCB + component surface area
h_forced = 800 # W/(m^2*K), forced convection in hydrocarbon
t_surface = 75 # C, average component temperature
t_fluid = 40 # C, bulk fluid temperature
q_total = immersion_heat_transfer(h_forced, total_area, t_surface, t_fluid)
print(f"Heat removal: {q_total:.0f} W") # 4200 W -- enough for full server
Air-cooled servers have thermal gradients of 20-30 C between inlet-side and exhaust-side components. In immersion, the fluid temperature is nearly uniform because the fluid’s thermal mass dampens local hot spots. This means all GPUs in a server run at similar temperatures, eliminating the “last GPU is hottest” problem that causes thermal throttling in air-cooled DGX systems.
Two-Phase Immersion Cooling
Two-phase immersion uses a low-boiling-point dielectric fluid (typically boiling at 49-61 C) that vaporizes on contact with hot components. The phase change absorbs latent heat — approximately 100-150 kJ/kg for engineered fluids — providing extremely efficient cooling.
Phase Change Physics
The latent heat of vaporization provides a massive thermal buffer:
where is the latent heat of vaporization. For 3M Novec 7100, kJ/kg. To dissipate 700 W:
The vapor rises to a condenser coil at the top of the tank, where it condenses back to liquid and drips down. This creates a self-regulating cycle: hotter components generate more vapor and therefore receive more cooling.
def two_phase_flow_rate(power_watts, latent_heat_j_per_kg):
"""Calculate required fluid vaporization rate."""
mass_flow_kg_s = power_watts / latent_heat_j_per_kg
return mass_flow_kg_s * 1000 # g/s
# Different fluids
fluids = {
"Novec 7100": 112000, # J/kg
"Novec 649": 88000,
"FC-72": 88000,
"Water (reference)": 2260000,
}
for fluid, hfg in fluids.items():
rate = two_phase_flow_rate(700, hfg)
print(f"{fluid}: {rate:.2f} g/s to cool 700W")
# Novec 7100: 6.25 g/s
# Novec 649: 7.95 g/s
# FC-72: 7.95 g/s
# Water (reference): 0.31 g/s
Practical Challenges
Two-phase immersion faces deployment challenges that limit adoption:
-
Fluid loss: Vapor escaping the tank during maintenance (opening the lid) represents direct fluid loss. Novec 7100 costs 80/L = $32,000 in fluid alone. Losing 1% per maintenance event adds up.
-
Condenser sizing: The condenser must handle peak vapor generation from all GPUs simultaneously. Undersized condensers allow vapor to accumulate and pressurize the tank.
-
Non-condensable gas management: Air ingress during maintenance dissolves in the fluid and later comes out of solution as bubbles, reducing heat transfer effectiveness.
Cooling Solution Cost per Rack (100 kW rack, 5-year TCO)
(USD (thousands))Rack Density and PUE Impact
The choice of cooling technology directly determines rack density (kW per rack) and Power Usage Effectiveness (PUE).
Rack Density and PUE by Cooling Method
| Cooling Method | Max Rack Density (kW) | Typical PUE | Cooling Overhead | Best For |
|---|---|---|---|---|
| Air (standard) | 15-20 | 1.4-1.6 | 30-40% of IT load | Small clusters, edge |
| Air (high-density) | 30-40 | 1.3-1.5 | 25-35% | A100 clusters |
| Direct-to-chip liquid | 60-120 | 1.1-1.2 | 8-15% | H100/B200 clusters |
| Single-phase immersion | 80-150 | 1.03-1.10 | 3-8% | Dense GPU racks |
| Two-phase immersion | 100-200 | 1.02-1.06 | 2-5% | Maximum density |
The PUE improvement from air (1.4) to liquid (1.1) saves significant operating cost. For a 10 MW datacenter:
def annual_cooling_cost(it_power_mw, pue, electricity_rate_per_kwh=0.08):
"""Calculate annual cooling electricity cost."""
cooling_power_mw = it_power_mw * (pue - 1.0)
annual_kwh = cooling_power_mw * 1000 * 8760
return annual_kwh * electricity_rate_per_kwh
# Compare cooling methods for 10 MW IT load
for method, pue in [("Air", 1.4), ("D2C Liquid", 1.1), ("Immersion", 1.05)]:
cost = annual_cooling_cost(10.0, pue)
print(f"{method} (PUE {pue}): ${cost:,.0f}/year cooling cost")
# Air (PUE 1.4): $3,504,000/year
# D2C Liquid (PUE 1.1): $876,000/year
# Immersion (PUE 1.05): $438,000/year
GPU Thermal Throttling and Performance Impact
GPUs implement dynamic thermal management that reduces clock speed when junction temperature exceeds a threshold (typically 83 C for NVIDIA datacenter GPUs). The throttling curve is approximately linear between the throttle onset temperature and the shutdown temperature (typically 95 C).
// Simplified GPU thermal throttling model
struct ThermalThrottler {
float throttle_onset_c = 83.0f; // Start reducing clocks
float shutdown_c = 95.0f; // Emergency shutdown
float base_clock_mhz = 1410.0f; // H100 base clock
float boost_clock_mhz = 1620.0f; // H100 boost clock
float effective_clock(float junction_temp_c) {
if (junction_temp_c < throttle_onset_c) {
return boost_clock_mhz; // Full boost
}
if (junction_temp_c >= shutdown_c) {
return 0.0f; // Shutdown
}
// Linear throttle between onset and shutdown
float throttle_fraction =
(junction_temp_c - throttle_onset_c) /
(shutdown_c - throttle_onset_c);
float min_clock = base_clock_mhz * 0.7f; // 70% of base
return boost_clock_mhz -
throttle_fraction * (boost_clock_mhz - min_clock);
}
};
Training Throughput vs Cooling Method (8x H100 SXM, Llama 70B)
(tokens/sec)The 10-13% throughput improvement from liquid cooling is not just from avoiding throttling. Lower junction temperatures also improve transistor switching characteristics, reducing gate delay and allowing the GPU to sustain higher boost clocks without voltage increases.
Infrastructure Requirements
Direct-to-Chip Liquid Cooling Infrastructure
Per-rack requirements:
CDU: 1x rear-door or side-car CDU, 80-120 kW capacity
Manifolds: Supply and return manifolds per rack
Quick disconnects: Dripless QDs at each server sled
Leak detection: Conductive tape sensors under each sled
Facility water: 15-20 C chilled water supply, 40-80 L/min per rack
Per-server requirements:
Cold plates: 1 per GPU (8 per DGX), 1 per CPU (2 per server)
Hoses: Flexible tubing from cold plates to manifold
Flow balancing: Orifice or valve per cold plate branch
Immersion Tank Infrastructure
Per-tank requirements:
Tank: Sealed steel/aluminum enclosure, 400-600L capacity
Heat exchanger: Internal plate HX or external CDU
Fluid: 400-600L dielectric fluid ($20,000-$50,000)
Pump (single-phase): Submersible or external, 20-60 L/min
Condenser (two-phase): Roof-mounted or in-tank condenser coils
Fluid management: Filtration, dehumidification, top-off system
Immersion cooling complicates server maintenance. Removing a server sled requires draining or displacing fluid, waiting for the board to drip-dry (dielectric fluid is non-conductive but coats all surfaces), and handling fluid-slick components. Average hot-swap time increases from 5 minutes (air-cooled) to 30-45 minutes (immersion). For large clusters with frequent hardware failures, this maintenance overhead is significant.
Decision Framework
Choosing a cooling technology depends on cluster scale, rack density requirements, facility constraints, and operational maturity.
Cooling Technology Decision Matrix
| Factor | Air | D2C Liquid | Single-Phase Immersion | Two-Phase Immersion |
|---|---|---|---|---|
| CapEx per rack | $5-10K | $15-25K | $30-60K | $50-100K |
| OpEx (5yr, 100kW rack) | $175K | $44K | $22K | $15K |
| Max GPU TDP supported | 500W | 1500W+ | 1500W+ | 2000W+ |
| Maintenance complexity | Low | Medium | High | Very High |
| Retrofit to existing DC | N/A | Medium effort | Major renovation | Major renovation |
| Maturity (2025) | Decades | Production-ready | Early production | Pilot stage |
| Best GPU generation | A100 and below | H100/B200 | H100/B200/GB200 | Future >1kW GPUs |
For most organizations deploying H100 or B200 clusters in 2025, direct-to-chip liquid cooling is the recommended path. It delivers 90% of immersion’s thermal benefits at 40% of the infrastructure cost, with established supply chains from vendors like CoolIT, Asetek, and Vertiv. Immersion makes sense for purpose-built facilities optimizing for maximum density and minimum PUE, but the operational complexity and fluid costs limit adoption to hyperscalers and specialized HPC centers.
The GB200 NVL72 Reference Design
NVIDIA’s GB200 NVL72 represents the industry’s direction. It ships as a pre-integrated liquid-cooled rack containing 36 Grace CPUs and 72 Blackwell GPUs, consuming up to 120 kW per rack. The cooling system is not an afterthought — it is integral to the product design.
GB200 NVL72 Cooling Specifications:
Total rack power: 120 kW
Cooling method: Direct-to-chip liquid (all GPUs and CPUs)
Coolant: Propylene glycol/water
CDU: Integrated rear-door CDU
Facility water req: 25-30 C supply, 180+ L/min
Redundancy: N+1 pump, N+1 CDU
PUE contribution: ~1.05 (cooling only)
This design eliminates the cooling technology decision for customers: if you buy GB200 NVL72, you get liquid cooling. The rack arrives with plumbing pre-installed. The only facility requirement is chilled water supply at the specified flow rate and temperature.
This is likely the model for future GPU platforms. As TDP continues to climb toward 1500-2000 W per accelerator, liquid cooling transitions from optional to mandatory, and the distinction between “server” and “cooling system” disappears.
Summary
GPU cooling has evolved from a solved problem (bolt on a heatsink, point a fan at it) to a critical infrastructure decision that determines cluster density, operating cost, and even GPU performance. Air cooling hits its practical limit around 40 kW per rack and 500 W per GPU. Direct-to-chip liquid cooling extends the range to 120 kW per rack and 1500+ W per GPU while reducing PUE from 1.4 to 1.1. Immersion cooling pushes further but at higher complexity and cost. For current-generation AI clusters, direct-to-chip liquid cooling is the sweet spot — and it is rapidly becoming the default, not the exception.