An RTX 4090 has 16,384 CUDA cores. An H100 has 16,896 CUDA cores—3% more. The RTX 4090 costs 30,000—16x more. On paper, the specs look suspiciously similar. In practice, the H100 delivers 3x the inference throughput for LLMs over 13B parameters, and the RTX 4090 cannot run distributed training workloads at all due to PCIe-only connectivity. The price difference buys five non-negotiable capabilities for production deployment: 80 GB HBM3 vs 24 GB GDDR6X (3.3x capacity), 3.35 TB/s vs 1.01 TB/s bandwidth (3.3x), NVLink 900 GB/s vs PCIe 32 GB/s for multi-GPU (28x inter-GPU bandwidth), ECC memory to prevent silent data corruption in month-long training runs, and MIG virtualization for multi-tenant serving. Whether the 16x premium is justified depends on one question: does your model fit in 24 GB and serve one user at a time, or does it not?
This post covers the architectural differences that justify the price gap, real-world LLM inference benchmarks at batch sizes 1-64, multi-GPU scaling efficiency with and without NVLink, reliability specifications, and the break-even cost analysis for different deployment scenarios.
Hardware Specification Comparison
H100 SXM vs RTX 4090: Full Specification Comparison
| Specification | H100 SXM | RTX 4090 | Ratio (H100/4090) |
|---|---|---|---|
| Architecture | Hopper (SM 9.0) | Ada Lovelace (SM 8.9) | Different |
| CUDA Cores | 16,896 | 16,384 | 1.03x |
| Tensor Cores (4th gen) | 528 | 512 | 1.03x |
| FP16 Tensor TFLOPS | 990 | 330 | 3.0x |
| INT8 Tensor TFLOPS | 1,979 | 660 | 3.0x |
| FP64 TFLOPS | 67 | 1.3 (rate-limited) | 51x |
| Memory | 80 GB HBM3 | 24 GB GDDR6X | 3.3x capacity |
| Memory Bandwidth | 3,350 GB/s | 1,008 GB/s | 3.3x |
| Interconnect | NVLink 4.0 (900 GB/s) | PCIe Gen4 x16 (32 GB/s) | 28x |
| TDP | 700W | 450W | 1.56x |
| ECC Memory | Yes (HBM3 built-in) | No (GDDR6X) | N/A |
| MIG Support | Yes (up to 7 instances) | No | N/A |
| Price (approx) | $30,000 | $1,800 | 16.7x |
Memory: The Defining Difference
HBM3 vs GDDR6X
The memory subsystem is where the H100 and RTX 4090 diverge most significantly:
H100 SXM Memory:
Type: HBM3 (High Bandwidth Memory, 3rd generation)
Capacity: 80 GB (5 stacks of 16 GB)
Bandwidth: 3,350 GB/s
Interface: 5120-bit wide (5 stacks x 1024 bits each)
ECC: Yes, built into HBM3 standard
Technology: 3D-stacked DRAM dies on silicon interposer
Cost: HBM3 alone costs ~$3,000-5,000 per GPU
RTX 4090 Memory:
Type: GDDR6X (Graphics DDR6, enhanced)
Capacity: 24 GB
Bandwidth: 1,008 GB/s
Interface: 384-bit wide (12 channels x 32 bits)
ECC: No (GDDR6X does not include ECC)
Technology: Standard DRAM packages on PCB
Cost: GDDR6X costs ~$100-150 per GPU
Bandwidth comparison:
H100: 3,350 GB/s -> 3.3x more than RTX 4090
Per-GB bandwidth: H100 = 41.9 GB/s/GB, RTX 4090 = 42.0 GB/s/GB
Surprisingly similar per-GB bandwidth! The H100 advantage is capacity,
not bandwidth efficiency.
Why Memory Capacity Matters for LLMs
# Memory requirements for LLM inference
def model_memory_requirements(params_B, dtype_bytes, kv_cache_config):
"""Calculate total GPU memory needed."""
# Model weights
weight_gb = params_B * 1e9 * dtype_bytes / 1e9
# KV cache per token
n_layers = kv_cache_config["n_layers"]
n_kv_heads = kv_cache_config["n_kv_heads"]
head_dim = kv_cache_config["head_dim"]
kv_bytes_per_token = 2 * n_layers * n_kv_heads * head_dim * dtype_bytes
# KV cache for max context
max_seq = kv_cache_config["max_seq_len"]
max_batch = kv_cache_config["max_batch_size"]
kv_gb = kv_bytes_per_token * max_seq * max_batch / 1e9
return weight_gb, kv_gb, weight_gb + kv_gb
# Llama-2-7B at FP16:
weights, kv, total = model_memory_requirements(
7, 2, {"n_layers": 32, "n_kv_heads": 32, "head_dim": 128,
"max_seq_len": 4096, "max_batch_size": 1}
)
# weights: 14 GB, kv: 1.0 GB, total: 15 GB
# Fits on RTX 4090 (24 GB)! Room for BS=1 inference.
# Llama-2-7B at FP16, batch serving (BS=32):
weights_bs32, kv_bs32, total_bs32 = model_memory_requirements(
7, 2, {"n_layers": 32, "n_kv_heads": 32, "head_dim": 128,
"max_seq_len": 4096, "max_batch_size": 32}
)
# weights: 14 GB, kv: 32 GB, total: 46 GB
# Does NOT fit on RTX 4090! Needs H100-80GB.
# Llama-2-70B at FP16:
weights_70b, kv_70b, total_70b = model_memory_requirements(
70, 2, {"n_layers": 80, "n_kv_heads": 8, "head_dim": 128,
"max_seq_len": 4096, "max_batch_size": 1}
)
# weights: 140 GB, kv: 2.5 GB, total: 142.5 GB
# Needs 2x H100-80GB. Does not fit on any number of RTX 4090s
# (24 GB per card, and PCIe inter-GPU bandwidth is too slow for TP).
# Llama-2-70B at INT4 (AWQ):
weights_70b_q, kv_70b_q, total_70b_q = model_memory_requirements(
70, 0.5, {"n_layers": 80, "n_kv_heads": 8, "head_dim": 128,
"max_seq_len": 4096, "max_batch_size": 1}
)
# weights: 35 GB, kv: 2.5 GB (KV cache still FP16)
# Hmm, 37.5 GB. Still exceeds RTX 4090 24 GB.
# But with INT4 weights + INT8 KV cache: ~36 GB. Still too large.
# RTX 4090 can only run 70B with 2-3 bit quantization (quality concerns).
The 24 GB limit on the RTX 4090 is a hard constraint. It limits both the maximum model size and the maximum batch size for serving. A single H100-80GB can serve Llama-70B at INT4 with batch sizes up to 16, while an RTX 4090 cannot run 70B at any quantization level that preserves reasonable quality (needs at minimum 35 GB even at INT4).
Compute Throughput: The 3x Gap
Why Tensor Core TFLOPS Differ by 3x Despite Similar Core Counts
The H100 and RTX 4090 have nearly identical numbers of CUDA cores and tensor cores. The 3x TFLOPS gap comes from the tensor core architecture and clock speed:
H100 Tensor Core (4th gen Hopper):
Instruction: wgmma (Warp Group MMA)
Shape: 64x128x16 (FP16) per warp group
A warp group = 4 warps = 128 threads
Throughput: 256 FP16 FMA ops per clock per SM
RTX 4090 Tensor Core (4th gen Ada):
Instruction: mma (same as Ampere-class)
Shape: 16x8x16 (FP16) per warp
Throughput: 128 FP16 FMA ops per clock per SM
(Ada tensor cores are similar to Ampere, not Hopper)
Clock speed:
H100: 1,830 MHz boost (lower, optimized for sustained throughput)
RTX 4090: 2,520 MHz boost (higher, consumer gaming optimization)
Calculation:
H100: 132 SMs x 256 ops/clock x 1,830 MHz x 2 (FMA) = 990 TFLOPS
RTX 4090: 128 SMs x 128 ops/clock x 2,520 MHz x 2 (FMA) = 330 TFLOPS
The 3x gap comes from:
- Hopper wgmma does 2x more ops per clock than Ada mma (2x)
- Slightly more SMs (132 vs 128) (1.03x)
- Slightly lower clock (0.73x)
- Net: 2x * 1.03 * 0.73 = 1.50x per SM, but wgmma efficiency is higher
- Real-world gap: 2.5-3x in GEMM throughput
Real-World GEMM Benchmarks
# GEMM throughput comparison: H100 vs RTX 4090
# cuBLAS, FP16, various problem sizes
gemm_benchmarks = {
# (M, N, K): (H100_TFLOPS, RTX4090_TFLOPS)
(4096, 4096, 4096): (305, 120), # Large square: H100 2.5x
(8192, 8192, 8192): (380, 145), # Very large: H100 2.6x
(1, 4096, 4096): (0.8, 0.35), # GEMV (decode): H100 2.3x
(32, 4096, 4096): (22, 9.5), # Small batch: H100 2.3x
(512, 4096, 4096): (285, 112), # Medium batch: H100 2.5x
(1, 14336, 4096): (1.1, 0.52), # Llama MLP GEMV: H100 2.1x
}
# Key observation: The GEMV (decode, M=1) ratio is only 2.1-2.3x,
# not 3x, because GEMV is memory-bandwidth-bound, not compute-bound.
# H100 bandwidth advantage: 3,350/1,008 = 3.3x
# But GEMV does not saturate bandwidth due to small M.
# Effective bandwidth ratio at M=1: ~2.0-2.5x
GEMM Throughput: H100 SXM vs RTX 4090 (cuBLAS FP16)
(TFLOPS)Multi-GPU Scaling: NVLink vs PCIe
The Interconnect Gap
This is where the datacenter GPU advantage is most dramatic:
H100 SXM interconnect:
NVLink 4.0: 900 GB/s bidirectional (18 links x 50 GB/s each)
NVSwitch: enables all-to-all NVLink between 8 GPUs in a node
Inter-node: InfiniBand 400 Gb/s (50 GB/s)
RTX 4090 interconnect:
PCIe Gen4 x16: 32 GB/s bidirectional (16 GB/s each direction)
No NVLink (removed from consumer cards since RTX 30-series)
No SLI/NVLink bridge for multi-GPU
Ratio: 900 / 32 = 28x bandwidth difference
Impact on tensor parallelism:
Tensor parallel GEMM requires all-reduce after each layer.
All-reduce of a 4096-element FP16 vector (8 KB):
- NVLink: 8 KB / 450 GB/s = 0.018 us (effectively free)
- PCIe: 8 KB / 16 GB/s = 0.5 us (still small)
All-reduce of a 8192x8192 FP16 matrix (128 MB):
- NVLink: 128 MB / 450 GB/s = 0.28 ms
- PCIe: 128 MB / 16 GB/s = 8.0 ms (28x slower!)
# Tensor parallel efficiency comparison
def tp_efficiency(gemm_time_ms, comm_time_ms, num_gpus):
"""
Efficiency of tensor parallelism.
Each GPU does 1/num_gpus of the GEMM, then all-reduce.
"""
per_gpu_gemm = gemm_time_ms / num_gpus
total_step = per_gpu_gemm + comm_time_ms
ideal_step = gemm_time_ms / num_gpus
efficiency = ideal_step / total_step
return efficiency
# Llama-70B, single layer GEMM, TP=2
# GEMM: 8192x8192 FP16, ~3.5 ms on single H100
# All-reduce: 128 MB
# H100 NVLink:
eff_h100 = tp_efficiency(3.5, 0.28, 2)
# per_gpu_gemm = 1.75, total = 1.75 + 0.28 = 2.03, eff = 86%
# RTX 4090 PCIe:
# GEMM on 4090: ~8.5 ms for same size
eff_4090 = tp_efficiency(8.5, 8.0, 2)
# per_gpu_gemm = 4.25, total = 4.25 + 8.0 = 12.25, eff = 35%
# RTX 4090 tensor parallelism is catastrophically inefficient
# because PCIe communication time exceeds compute time
Multi-GPU Scaling Efficiency (Llama-70B Inference, TP=2)
| Configuration | Per-GPU GEMM | Communication | Total Step | Efficiency |
|---|---|---|---|---|
| 2x H100 NVLink | 1.75 ms | 0.28 ms | 2.03 ms | 86% |
| 2x A100 NVLink | 2.80 ms | 0.42 ms | 3.22 ms | 87% |
| 2x RTX 4090 PCIe | 4.25 ms | 8.0 ms | 12.25 ms | 35% |
| 2x RTX 3090 NVLink | 5.10 ms | 1.2 ms | 6.30 ms | 81% |
The RTX 3090 was the last consumer GPU with NVLink support (NVLink 3.0, 112.5 GB/s bidirectional). Its multi-GPU scaling is dramatically better than the RTX 4090’s PCIe-only interconnect. NVIDIA removed NVLink from consumer cards specifically to segment the datacenter market.
Reliability and Uptime
ECC Memory
H100 HBM3 with ECC:
- Single-bit errors detected and corrected automatically (SECDED)
- Double-bit errors detected (causes GPU reset, not silent corruption)
- Error rate: ~1 correctable error per GB per year (HBM3 at datacenter temps)
- For 80 GB: ~80 correctable errors per year (all handled transparently)
- Uncorrectable error rate: ~1 per 100 GPU-years
RTX 4090 GDDR6X without ECC:
- No error detection or correction
- Bit-flip rate: ~1 per GB per month at normal temperatures
- For 24 GB: ~24 silent bit-flips per month
- These corrupt model weights, KV cache, or activations silently
- Impact: occasional garbage output, NaN propagation, wrong results
For a training run:
- 1000 H100s for 3 months: ~80,000 correctable errors (all handled)
- 1000 RTX 4090s for 3 months: ~72,000 silent corruptions
- Training would produce incorrect results on consumer GPUs
For inference:
- H100: bit-flips are invisible to the user (ECC handles them)
- RTX 4090: occasional corrupted output, recoverable by re-running
- At low request volume: acceptable (just retry)
- At high request volume (serving): unacceptable SLA risk
Thermal Design and Sustained Performance
H100 SXM (liquid-cooled):
TDP: 700W
Thermal solution: cold plate with liquid cooling loop
Sustained performance: 100% of rated TFLOPS indefinitely
Operating temperature: 45-80C junction (controlled by datacenter cooling)
No thermal throttling under any workload
RTX 4090 (air-cooled):
TDP: 450W
Thermal solution: 3-slot air cooler
Sustained performance: 85-95% of boost clock under heavy compute
Operating temperature: 65-85C junction (depends on case airflow)
Thermal throttling: reduces clock speed above ~83C junction
Power limit: may reduce clock under sustained 450W loads
Impact on sustained throughput:
H100: rated 990 TFLOPS, achieves 950-980 TFLOPS sustained (96-99%)
RTX 4090: rated 330 TFLOPS, achieves 280-310 TFLOPS sustained (85-94%)
The gap increases under sustained workloads (training, serving)
Cost-Performance Analysis
Cost Per Token
# Cost-performance analysis for LLM inference
def cost_per_token_comparison():
# Hardware costs (purchase price, not cloud)
h100_cost = 30000 # USD
rtx4090_cost = 1800 # USD
# Operating costs (power + cooling, per year)
h100_power_cost = 700 * 8760 * 0.08 / 1000 # 700W * hours/year * $/kWh
# = $490/year
rtx4090_power_cost = 350 * 8760 * 0.08 / 1000 # 350W average (throttled)
# = $245/year
# 3-year TCO
h100_tco = h100_cost + 3 * h100_power_cost # $31,470
rtx4090_tco = rtx4090_cost + 3 * rtx4090_power_cost # $2,535
# Throughput for Llama-2-7B FP16 decode (BS=1)
h100_tps = 95 # tokens/sec
rtx4090_tps = 42 # tokens/sec
# Tokens over 3 years (24/7 operation)
h100_total_tokens = h100_tps * 86400 * 365 * 3
rtx4090_total_tokens = rtx4090_tps * 86400 * 365 * 3
# Cost per million tokens
h100_cpm = h100_tco / (h100_total_tokens / 1e6)
rtx4090_cpm = rtx4090_tco / (rtx4090_total_tokens / 1e6)
return {
"H100": {"tco_3yr": h100_tco, "cpm": h100_cpm, "tps": h100_tps},
"RTX4090": {"tco_3yr": rtx4090_tco, "cpm": rtx4090_cpm, "tps": rtx4090_tps},
}
# Results (Llama-2-7B FP16, BS=1):
# H100: TCO=$31,470, throughput=95 tok/s, cost=$0.0035/M tokens
# RTX4090: TCO=$2,535, throughput=42 tok/s, cost=$0.00064/M tokens
# RTX 4090 is 5.5x cheaper per token for this specific workload!
# But for Llama-2-7B FP16, BS=32 (batch serving):
# H100: throughput=2,800 tok/s, cost=$0.00012/M tokens
# RTX4090: throughput=320 tok/s, cost=$0.00084/M tokens (KV cache barely fits)
# H100 is 7x cheaper per token at batch serving!
# For Llama-2-70B (any precision):
# H100: runs (1 GPU at INT4), throughput=98 tok/s
# RTX4090: CANNOT RUN (insufficient memory)
# H100 wins by default.
Cost Per Million Tokens (3-Year TCO, 24/7 Operation)
| Model + Config | H100 Cost/M Tokens | RTX 4090 Cost/M Tokens | Winner |
|---|---|---|---|
| Llama-7B FP16, BS=1 | $0.0035 | $0.00064 | RTX 4090 (5.5x cheaper) |
| Llama-7B INT4, BS=1 | $0.0018 | $0.00038 | RTX 4090 (4.7x cheaper) |
| Llama-7B FP16, BS=32 | $0.00012 | $0.00084 | H100 (7x cheaper) |
| Llama-13B FP16, BS=1 | $0.0052 | $0.0015 | RTX 4090 (3.5x cheaper) |
| Llama-70B INT4, BS=1 | $0.0032 | N/A (OOM) | H100 (only option) |
| Llama-70B INT4, BS=16 | $0.00020 | N/A (OOM) | H100 (only option) |
MIG: Multi-Instance GPU
The H100 supports MIG (Multi-Instance GPU), which partitions a single GPU into up to 7 isolated instances, each with guaranteed memory and compute:
H100 MIG configurations:
1g.10gb: 1/7 of GPU (14.3% compute, 10 GB memory) x 7 instances
2g.20gb: 2/7 of GPU (28.6% compute, 20 GB memory) x 3 instances
3g.40gb: 3/7 of GPU (42.9% compute, 40 GB memory) x 2 instances
7g.80gb: Full GPU (100% compute, 80 GB memory) x 1 instance
Use case: multi-tenant serving
- Run 7 different small models on one H100
- Each instance is isolated (no interference, no memory leaks)
- Each instance has its own error counters, performance counters
- Kubernetes/Docker can assign MIG instances to containers
Example: 7 instances of Llama-7B INT4 (each needs ~4 GB weights + 6 GB KV)
Each instance: 10 GB memory, ~140 TFLOPS FP16
Total: 7 models running in isolation on one $30K GPU
Effective cost per model: $4,300
RTX 4090: No MIG support
- Must run one model per GPU
- No isolation between workloads
- Cannot share a GPU safely between users
When to Use Each GPU
# Decision framework
def choose_gpu(workload):
# Single-user inference on models <= 13B
if (workload["model_size_B"] <= 13
and workload["batch_size"] <= 4
and workload["users"] == 1):
return "RTX 4090"
# Cheaper per token, sufficient memory, no multi-GPU needed
# Batch serving (multiple concurrent users)
if workload["batch_size"] >= 16 or workload["users"] >= 8:
return "H100"
# Higher batch throughput, more KV cache memory, MIG for isolation
# Large models (>= 30B parameters)
if workload["model_size_B"] >= 30:
if workload["quantization_bits"] >= 4:
return "H100" # Only option with enough memory
else:
return "H100" # Even at 2-bit, 70B = 18 GB, tight on 4090
# Training (any scale)
if workload["type"] == "training":
if workload["multi_gpu"]:
return "H100" # NVLink required for efficient TP/PP/DP
else:
return "RTX 4090 or H100" # Single-GPU fine-tuning: 4090 is fine
# Development and experimentation
if workload["type"] == "development":
return "RTX 4090" # Cost-effective for iteration
# Edge / embedded inference
if workload["power_budget_watts"] < 100:
return "Neither" # Use Jetson, L4, or T4
return "H100" # Default to datacenter for production
Where Each GPU Wins (Performance Per Dollar, Higher is Better)
(relative perf/$)The RTX 4090 Cluster Fallacy
A common argument: “Buy 16 RTX 4090s for the price of one H100 and get more total FLOPS.” This ignores three critical factors:
1. Interconnect bottleneck:
16x RTX 4090s connected via PCIe: 32 GB/s per GPU
1x H100 SXM: 900 GB/s NVLink
For any workload requiring multi-GPU communication (which includes
ALL models that don't fit on one 24 GB GPU), the RTX 4090 cluster
spends more time communicating than computing.
2. Memory fragmentation:
16x RTX 4090: 16 x 24 GB = 384 GB total, but distributed
1x H100: 80 GB contiguous
A 70B FP16 model (140 GB) requires tensor parallelism.
On RTX 4090s: need 6 GPUs (6x24=144 GB), but TP over PCIe is 35% efficient
On H100: need 2 GPUs (2x80=160 GB), TP over NVLink is 86% efficient
Effective throughput: 2x H100 beats 6x RTX 4090
3. Reliability and operations:
16 consumer GPUs = 16 potential failure points without ECC
Fan failures, thermal throttling, driver instability
No MIG, no hardware monitoring designed for 24/7 operation
Operational overhead far exceeds hardware savings
Building an “RTX 4090 cluster” for production AI serving is a false economy. The lack of NVLink makes multi-GPU workloads 2-4x slower than equivalent H100 configurations, ECC absence risks silent data corruption, and consumer GPUs are not designed for 24/7 datacenter operation. RTX 4090s are excellent for single-GPU workloads (development, personal inference, small model serving) but are not a viable H100 replacement for production multi-GPU deployments.
Summary
The H100 and RTX 4090 serve fundamentally different markets. The RTX 4090 is 3-5x more cost-effective per token for single-GPU inference on models that fit in 24 GB (up to 13B at FP16 or 30B at INT4). The H100 is necessary for: models that require more than 24 GB, batch serving with more than 4 concurrent requests, multi-GPU tensor parallelism (NVLink vs PCIe is a 28x bandwidth gap), production serving requiring ECC and reliability, and multi-tenant deployment via MIG. The choice is not which GPU is “better” but which workload you are running. For development, personal use, and small model inference, the RTX 4090 is the rational choice. For production serving, large model deployment, and training, the H100’s premium is justified by capabilities that the RTX 4090 physically cannot provide.