Part of Series GPU Hardware & AI Accelerators 22 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

An RTX 4090 has 16,384 CUDA cores. An H100 has 16,896 CUDA cores—3% more. The RTX 4090 costs 1,800.TheH100costs1,800. The H100 costs 30,000—16x more. On paper, the specs look suspiciously similar. In practice, the H100 delivers 3x the inference throughput for LLMs over 13B parameters, and the RTX 4090 cannot run distributed training workloads at all due to PCIe-only connectivity. The price difference buys five non-negotiable capabilities for production deployment: 80 GB HBM3 vs 24 GB GDDR6X (3.3x capacity), 3.35 TB/s vs 1.01 TB/s bandwidth (3.3x), NVLink 900 GB/s vs PCIe 32 GB/s for multi-GPU (28x inter-GPU bandwidth), ECC memory to prevent silent data corruption in month-long training runs, and MIG virtualization for multi-tenant serving. Whether the 16x premium is justified depends on one question: does your model fit in 24 GB and serve one user at a time, or does it not?

This post covers the architectural differences that justify the price gap, real-world LLM inference benchmarks at batch sizes 1-64, multi-GPU scaling efficiency with and without NVLink, reliability specifications, and the break-even cost analysis for different deployment scenarios.

Hardware Specification Comparison

📊

H100 SXM vs RTX 4090: Full Specification Comparison

SpecificationH100 SXMRTX 4090Ratio (H100/4090)
Architecture Hopper (SM 9.0) Ada Lovelace (SM 8.9) Different
CUDA Cores 16,896 16,384 1.03x
Tensor Cores (4th gen) 528 512 1.03x
FP16 Tensor TFLOPS 990 330 3.0x
INT8 Tensor TFLOPS 1,979 660 3.0x
FP64 TFLOPS 67 1.3 (rate-limited) 51x
Memory 80 GB HBM3 24 GB GDDR6X 3.3x capacity
Memory Bandwidth 3,350 GB/s 1,008 GB/s 3.3x
Interconnect NVLink 4.0 (900 GB/s) PCIe Gen4 x16 (32 GB/s) 28x
TDP 700W 450W 1.56x
ECC Memory Yes (HBM3 built-in) No (GDDR6X) N/A
MIG Support Yes (up to 7 instances) No N/A
Price (approx) $30,000 $1,800 16.7x

Memory: The Defining Difference

HBM3 vs GDDR6X

The memory subsystem is where the H100 and RTX 4090 diverge most significantly:

H100 SXM Memory:
  Type: HBM3 (High Bandwidth Memory, 3rd generation)
  Capacity: 80 GB (5 stacks of 16 GB)
  Bandwidth: 3,350 GB/s
  Interface: 5120-bit wide (5 stacks x 1024 bits each)
  ECC: Yes, built into HBM3 standard
  Technology: 3D-stacked DRAM dies on silicon interposer
  Cost: HBM3 alone costs ~$3,000-5,000 per GPU

RTX 4090 Memory:
  Type: GDDR6X (Graphics DDR6, enhanced)
  Capacity: 24 GB
  Bandwidth: 1,008 GB/s
  Interface: 384-bit wide (12 channels x 32 bits)
  ECC: No (GDDR6X does not include ECC)
  Technology: Standard DRAM packages on PCB
  Cost: GDDR6X costs ~$100-150 per GPU

Bandwidth comparison:
  H100: 3,350 GB/s -> 3.3x more than RTX 4090
  Per-GB bandwidth: H100 = 41.9 GB/s/GB, RTX 4090 = 42.0 GB/s/GB
  Surprisingly similar per-GB bandwidth! The H100 advantage is capacity,
  not bandwidth efficiency.

Why Memory Capacity Matters for LLMs

# Memory requirements for LLM inference

def model_memory_requirements(params_B, dtype_bytes, kv_cache_config):
    """Calculate total GPU memory needed."""
    # Model weights
    weight_gb = params_B * 1e9 * dtype_bytes / 1e9

    # KV cache per token
    n_layers = kv_cache_config["n_layers"]
    n_kv_heads = kv_cache_config["n_kv_heads"]
    head_dim = kv_cache_config["head_dim"]
    kv_bytes_per_token = 2 * n_layers * n_kv_heads * head_dim * dtype_bytes

    # KV cache for max context
    max_seq = kv_cache_config["max_seq_len"]
    max_batch = kv_cache_config["max_batch_size"]
    kv_gb = kv_bytes_per_token * max_seq * max_batch / 1e9

    return weight_gb, kv_gb, weight_gb + kv_gb

# Llama-2-7B at FP16:
weights, kv, total = model_memory_requirements(
    7, 2, {"n_layers": 32, "n_kv_heads": 32, "head_dim": 128,
            "max_seq_len": 4096, "max_batch_size": 1}
)
# weights: 14 GB, kv: 1.0 GB, total: 15 GB
# Fits on RTX 4090 (24 GB)! Room for BS=1 inference.

# Llama-2-7B at FP16, batch serving (BS=32):
weights_bs32, kv_bs32, total_bs32 = model_memory_requirements(
    7, 2, {"n_layers": 32, "n_kv_heads": 32, "head_dim": 128,
            "max_seq_len": 4096, "max_batch_size": 32}
)
# weights: 14 GB, kv: 32 GB, total: 46 GB
# Does NOT fit on RTX 4090! Needs H100-80GB.

# Llama-2-70B at FP16:
weights_70b, kv_70b, total_70b = model_memory_requirements(
    70, 2, {"n_layers": 80, "n_kv_heads": 8, "head_dim": 128,
            "max_seq_len": 4096, "max_batch_size": 1}
)
# weights: 140 GB, kv: 2.5 GB, total: 142.5 GB
# Needs 2x H100-80GB. Does not fit on any number of RTX 4090s
# (24 GB per card, and PCIe inter-GPU bandwidth is too slow for TP).

# Llama-2-70B at INT4 (AWQ):
weights_70b_q, kv_70b_q, total_70b_q = model_memory_requirements(
    70, 0.5, {"n_layers": 80, "n_kv_heads": 8, "head_dim": 128,
              "max_seq_len": 4096, "max_batch_size": 1}
)
# weights: 35 GB, kv: 2.5 GB (KV cache still FP16)
# Hmm, 37.5 GB. Still exceeds RTX 4090 24 GB.
# But with INT4 weights + INT8 KV cache: ~36 GB. Still too large.
# RTX 4090 can only run 70B with 2-3 bit quantization (quality concerns).
⚠️ Warning

The 24 GB limit on the RTX 4090 is a hard constraint. It limits both the maximum model size and the maximum batch size for serving. A single H100-80GB can serve Llama-70B at INT4 with batch sizes up to 16, while an RTX 4090 cannot run 70B at any quantization level that preserves reasonable quality (needs at minimum 35 GB even at INT4).

Compute Throughput: The 3x Gap

Why Tensor Core TFLOPS Differ by 3x Despite Similar Core Counts

The H100 and RTX 4090 have nearly identical numbers of CUDA cores and tensor cores. The 3x TFLOPS gap comes from the tensor core architecture and clock speed:

H100 Tensor Core (4th gen Hopper):
  Instruction: wgmma (Warp Group MMA)
  Shape: 64x128x16 (FP16) per warp group
  A warp group = 4 warps = 128 threads
  Throughput: 256 FP16 FMA ops per clock per SM

RTX 4090 Tensor Core (4th gen Ada):
  Instruction: mma (same as Ampere-class)
  Shape: 16x8x16 (FP16) per warp
  Throughput: 128 FP16 FMA ops per clock per SM
  (Ada tensor cores are similar to Ampere, not Hopper)

Clock speed:
  H100: 1,830 MHz boost (lower, optimized for sustained throughput)
  RTX 4090: 2,520 MHz boost (higher, consumer gaming optimization)

Calculation:
  H100: 132 SMs x 256 ops/clock x 1,830 MHz x 2 (FMA) = 990 TFLOPS
  RTX 4090: 128 SMs x 128 ops/clock x 2,520 MHz x 2 (FMA) = 330 TFLOPS

The 3x gap comes from:
  - Hopper wgmma does 2x more ops per clock than Ada mma (2x)
  - Slightly more SMs (132 vs 128) (1.03x)
  - Slightly lower clock (0.73x)
  - Net: 2x * 1.03 * 0.73 = 1.50x per SM, but wgmma efficiency is higher
  - Real-world gap: 2.5-3x in GEMM throughput

Real-World GEMM Benchmarks

# GEMM throughput comparison: H100 vs RTX 4090
# cuBLAS, FP16, various problem sizes

gemm_benchmarks = {
    # (M, N, K): (H100_TFLOPS, RTX4090_TFLOPS)
    (4096, 4096, 4096): (305, 120),     # Large square: H100 2.5x
    (8192, 8192, 8192): (380, 145),     # Very large: H100 2.6x
    (1, 4096, 4096):    (0.8, 0.35),    # GEMV (decode): H100 2.3x
    (32, 4096, 4096):   (22, 9.5),      # Small batch: H100 2.3x
    (512, 4096, 4096):  (285, 112),     # Medium batch: H100 2.5x
    (1, 14336, 4096):   (1.1, 0.52),    # Llama MLP GEMV: H100 2.1x
}

# Key observation: The GEMV (decode, M=1) ratio is only 2.1-2.3x,
# not 3x, because GEMV is memory-bandwidth-bound, not compute-bound.
# H100 bandwidth advantage: 3,350/1,008 = 3.3x
# But GEMV does not saturate bandwidth due to small M.
# Effective bandwidth ratio at M=1: ~2.0-2.5x

GEMM Throughput: H100 SXM vs RTX 4090 (cuBLAS FP16)

(TFLOPS)
H100: M=4096 GEMM 2.5x
305 TFLOPS
4090: M=4096 GEMM Reference
120 TFLOPS
H100: M=1 GEMV 2.3x
0.8 TFLOPS
4090: M=1 GEMV Reference
0.35 TFLOPS
H100: M=512 (prefill) 2.5x
285 TFLOPS
4090: M=512 (prefill) Reference
112 TFLOPS

The Interconnect Gap

This is where the datacenter GPU advantage is most dramatic:

H100 SXM interconnect:
  NVLink 4.0: 900 GB/s bidirectional (18 links x 50 GB/s each)
  NVSwitch: enables all-to-all NVLink between 8 GPUs in a node
  Inter-node: InfiniBand 400 Gb/s (50 GB/s)

RTX 4090 interconnect:
  PCIe Gen4 x16: 32 GB/s bidirectional (16 GB/s each direction)
  No NVLink (removed from consumer cards since RTX 30-series)
  No SLI/NVLink bridge for multi-GPU

Ratio: 900 / 32 = 28x bandwidth difference

Impact on tensor parallelism:
  Tensor parallel GEMM requires all-reduce after each layer.
  All-reduce of a 4096-element FP16 vector (8 KB):
  - NVLink: 8 KB / 450 GB/s = 0.018 us (effectively free)
  - PCIe: 8 KB / 16 GB/s = 0.5 us (still small)

  All-reduce of a 8192x8192 FP16 matrix (128 MB):
  - NVLink: 128 MB / 450 GB/s = 0.28 ms
  - PCIe: 128 MB / 16 GB/s = 8.0 ms (28x slower!)
# Tensor parallel efficiency comparison

def tp_efficiency(gemm_time_ms, comm_time_ms, num_gpus):
    """
    Efficiency of tensor parallelism.
    Each GPU does 1/num_gpus of the GEMM, then all-reduce.
    """
    per_gpu_gemm = gemm_time_ms / num_gpus
    total_step = per_gpu_gemm + comm_time_ms
    ideal_step = gemm_time_ms / num_gpus
    efficiency = ideal_step / total_step
    return efficiency

# Llama-70B, single layer GEMM, TP=2
# GEMM: 8192x8192 FP16, ~3.5 ms on single H100
# All-reduce: 128 MB

# H100 NVLink:
eff_h100 = tp_efficiency(3.5, 0.28, 2)
# per_gpu_gemm = 1.75, total = 1.75 + 0.28 = 2.03, eff = 86%

# RTX 4090 PCIe:
# GEMM on 4090: ~8.5 ms for same size
eff_4090 = tp_efficiency(8.5, 8.0, 2)
# per_gpu_gemm = 4.25, total = 4.25 + 8.0 = 12.25, eff = 35%

# RTX 4090 tensor parallelism is catastrophically inefficient
# because PCIe communication time exceeds compute time
📊

Multi-GPU Scaling Efficiency (Llama-70B Inference, TP=2)

ConfigurationPer-GPU GEMMCommunicationTotal StepEfficiency
2x H100 NVLink 1.75 ms 0.28 ms 2.03 ms 86%
2x A100 NVLink 2.80 ms 0.42 ms 3.22 ms 87%
2x RTX 4090 PCIe 4.25 ms 8.0 ms 12.25 ms 35%
2x RTX 3090 NVLink 5.10 ms 1.2 ms 6.30 ms 81%
ℹ️ Note

The RTX 3090 was the last consumer GPU with NVLink support (NVLink 3.0, 112.5 GB/s bidirectional). Its multi-GPU scaling is dramatically better than the RTX 4090’s PCIe-only interconnect. NVIDIA removed NVLink from consumer cards specifically to segment the datacenter market.

Reliability and Uptime

ECC Memory

H100 HBM3 with ECC:
  - Single-bit errors detected and corrected automatically (SECDED)
  - Double-bit errors detected (causes GPU reset, not silent corruption)
  - Error rate: ~1 correctable error per GB per year (HBM3 at datacenter temps)
  - For 80 GB: ~80 correctable errors per year (all handled transparently)
  - Uncorrectable error rate: ~1 per 100 GPU-years

RTX 4090 GDDR6X without ECC:
  - No error detection or correction
  - Bit-flip rate: ~1 per GB per month at normal temperatures
  - For 24 GB: ~24 silent bit-flips per month
  - These corrupt model weights, KV cache, or activations silently
  - Impact: occasional garbage output, NaN propagation, wrong results

For a training run:
  - 1000 H100s for 3 months: ~80,000 correctable errors (all handled)
  - 1000 RTX 4090s for 3 months: ~72,000 silent corruptions
  - Training would produce incorrect results on consumer GPUs

For inference:
  - H100: bit-flips are invisible to the user (ECC handles them)
  - RTX 4090: occasional corrupted output, recoverable by re-running
  - At low request volume: acceptable (just retry)
  - At high request volume (serving): unacceptable SLA risk

Thermal Design and Sustained Performance

H100 SXM (liquid-cooled):
  TDP: 700W
  Thermal solution: cold plate with liquid cooling loop
  Sustained performance: 100% of rated TFLOPS indefinitely
  Operating temperature: 45-80C junction (controlled by datacenter cooling)
  No thermal throttling under any workload

RTX 4090 (air-cooled):
  TDP: 450W
  Thermal solution: 3-slot air cooler
  Sustained performance: 85-95% of boost clock under heavy compute
  Operating temperature: 65-85C junction (depends on case airflow)
  Thermal throttling: reduces clock speed above ~83C junction
  Power limit: may reduce clock under sustained 450W loads

Impact on sustained throughput:
  H100: rated 990 TFLOPS, achieves 950-980 TFLOPS sustained (96-99%)
  RTX 4090: rated 330 TFLOPS, achieves 280-310 TFLOPS sustained (85-94%)
  The gap increases under sustained workloads (training, serving)

Cost-Performance Analysis

Cost Per Token

# Cost-performance analysis for LLM inference

def cost_per_token_comparison():
    # Hardware costs (purchase price, not cloud)
    h100_cost = 30000  # USD
    rtx4090_cost = 1800  # USD

    # Operating costs (power + cooling, per year)
    h100_power_cost = 700 * 8760 * 0.08 / 1000  # 700W * hours/year * $/kWh
    # = $490/year
    rtx4090_power_cost = 350 * 8760 * 0.08 / 1000  # 350W average (throttled)
    # = $245/year

    # 3-year TCO
    h100_tco = h100_cost + 3 * h100_power_cost   # $31,470
    rtx4090_tco = rtx4090_cost + 3 * rtx4090_power_cost  # $2,535

    # Throughput for Llama-2-7B FP16 decode (BS=1)
    h100_tps = 95     # tokens/sec
    rtx4090_tps = 42  # tokens/sec

    # Tokens over 3 years (24/7 operation)
    h100_total_tokens = h100_tps * 86400 * 365 * 3
    rtx4090_total_tokens = rtx4090_tps * 86400 * 365 * 3

    # Cost per million tokens
    h100_cpm = h100_tco / (h100_total_tokens / 1e6)
    rtx4090_cpm = rtx4090_tco / (rtx4090_total_tokens / 1e6)

    return {
        "H100": {"tco_3yr": h100_tco, "cpm": h100_cpm, "tps": h100_tps},
        "RTX4090": {"tco_3yr": rtx4090_tco, "cpm": rtx4090_cpm, "tps": rtx4090_tps},
    }

# Results (Llama-2-7B FP16, BS=1):
# H100:    TCO=$31,470, throughput=95 tok/s, cost=$0.0035/M tokens
# RTX4090: TCO=$2,535,  throughput=42 tok/s, cost=$0.00064/M tokens
# RTX 4090 is 5.5x cheaper per token for this specific workload!

# But for Llama-2-7B FP16, BS=32 (batch serving):
# H100:    throughput=2,800 tok/s, cost=$0.00012/M tokens
# RTX4090: throughput=320 tok/s, cost=$0.00084/M tokens  (KV cache barely fits)
# H100 is 7x cheaper per token at batch serving!

# For Llama-2-70B (any precision):
# H100:    runs (1 GPU at INT4), throughput=98 tok/s
# RTX4090: CANNOT RUN (insufficient memory)
# H100 wins by default.
📊

Cost Per Million Tokens (3-Year TCO, 24/7 Operation)

Model + ConfigH100 Cost/M TokensRTX 4090 Cost/M TokensWinner
Llama-7B FP16, BS=1 $0.0035 $0.00064 RTX 4090 (5.5x cheaper)
Llama-7B INT4, BS=1 $0.0018 $0.00038 RTX 4090 (4.7x cheaper)
Llama-7B FP16, BS=32 $0.00012 $0.00084 H100 (7x cheaper)
Llama-13B FP16, BS=1 $0.0052 $0.0015 RTX 4090 (3.5x cheaper)
Llama-70B INT4, BS=1 $0.0032 N/A (OOM) H100 (only option)
Llama-70B INT4, BS=16 $0.00020 N/A (OOM) H100 (only option)

MIG: Multi-Instance GPU

The H100 supports MIG (Multi-Instance GPU), which partitions a single GPU into up to 7 isolated instances, each with guaranteed memory and compute:

H100 MIG configurations:
  1g.10gb: 1/7 of GPU (14.3% compute, 10 GB memory) x 7 instances
  2g.20gb: 2/7 of GPU (28.6% compute, 20 GB memory) x 3 instances
  3g.40gb: 3/7 of GPU (42.9% compute, 40 GB memory) x 2 instances
  7g.80gb: Full GPU (100% compute, 80 GB memory) x 1 instance

Use case: multi-tenant serving
  - Run 7 different small models on one H100
  - Each instance is isolated (no interference, no memory leaks)
  - Each instance has its own error counters, performance counters
  - Kubernetes/Docker can assign MIG instances to containers

  Example: 7 instances of Llama-7B INT4 (each needs ~4 GB weights + 6 GB KV)
  Each instance: 10 GB memory, ~140 TFLOPS FP16
  Total: 7 models running in isolation on one $30K GPU
  Effective cost per model: $4,300

RTX 4090: No MIG support
  - Must run one model per GPU
  - No isolation between workloads
  - Cannot share a GPU safely between users

When to Use Each GPU

# Decision framework

def choose_gpu(workload):
    # Single-user inference on models <= 13B
    if (workload["model_size_B"] <= 13
        and workload["batch_size"] <= 4
        and workload["users"] == 1):
        return "RTX 4090"
        # Cheaper per token, sufficient memory, no multi-GPU needed

    # Batch serving (multiple concurrent users)
    if workload["batch_size"] >= 16 or workload["users"] >= 8:
        return "H100"
        # Higher batch throughput, more KV cache memory, MIG for isolation

    # Large models (>= 30B parameters)
    if workload["model_size_B"] >= 30:
        if workload["quantization_bits"] >= 4:
            return "H100"  # Only option with enough memory
        else:
            return "H100"  # Even at 2-bit, 70B = 18 GB, tight on 4090

    # Training (any scale)
    if workload["type"] == "training":
        if workload["multi_gpu"]:
            return "H100"  # NVLink required for efficient TP/PP/DP
        else:
            return "RTX 4090 or H100"  # Single-GPU fine-tuning: 4090 is fine

    # Development and experimentation
    if workload["type"] == "development":
        return "RTX 4090"  # Cost-effective for iteration

    # Edge / embedded inference
    if workload["power_budget_watts"] < 100:
        return "Neither"  # Use Jetson, L4, or T4

    return "H100"  # Default to datacenter for production

Where Each GPU Wins (Performance Per Dollar, Higher is Better)

(relative perf/$)
4090: 7B BS=1 decode 4090 territory
5.5 relative perf/$
4090: 13B BS=1 decode 4090 territory
3.5 relative perf/$
H100: 7B BS=32 serve H100 territory
7 relative perf/$
H100: 70B BS=1 decode Only option
1 relative perf/$
H100: 70B BS=16 serve H100 dominates
8.5 relative perf/$

The RTX 4090 Cluster Fallacy

A common argument: “Buy 16 RTX 4090s for the price of one H100 and get more total FLOPS.” This ignores three critical factors:

1. Interconnect bottleneck:
   16x RTX 4090s connected via PCIe: 32 GB/s per GPU
   1x H100 SXM: 900 GB/s NVLink
   For any workload requiring multi-GPU communication (which includes
   ALL models that don't fit on one 24 GB GPU), the RTX 4090 cluster
   spends more time communicating than computing.

2. Memory fragmentation:
   16x RTX 4090: 16 x 24 GB = 384 GB total, but distributed
   1x H100:      80 GB contiguous
   A 70B FP16 model (140 GB) requires tensor parallelism.
   On RTX 4090s: need 6 GPUs (6x24=144 GB), but TP over PCIe is 35% efficient
   On H100: need 2 GPUs (2x80=160 GB), TP over NVLink is 86% efficient
   Effective throughput: 2x H100 beats 6x RTX 4090

3. Reliability and operations:
   16 consumer GPUs = 16 potential failure points without ECC
   Fan failures, thermal throttling, driver instability
   No MIG, no hardware monitoring designed for 24/7 operation
   Operational overhead far exceeds hardware savings
🚨 Danger

Building an “RTX 4090 cluster” for production AI serving is a false economy. The lack of NVLink makes multi-GPU workloads 2-4x slower than equivalent H100 configurations, ECC absence risks silent data corruption, and consumer GPUs are not designed for 24/7 datacenter operation. RTX 4090s are excellent for single-GPU workloads (development, personal inference, small model serving) but are not a viable H100 replacement for production multi-GPU deployments.

Summary

The H100 and RTX 4090 serve fundamentally different markets. The RTX 4090 is 3-5x more cost-effective per token for single-GPU inference on models that fit in 24 GB (up to 13B at FP16 or 30B at INT4). The H100 is necessary for: models that require more than 24 GB, batch serving with more than 4 concurrent requests, multi-GPU tensor parallelism (NVLink vs PCIe is a 28x bandwidth gap), production serving requiring ECC and reliability, and multi-tenant deployment via MIG. The choice is not which GPU is “better” but which workload you are running. For development, personal use, and small model inference, the RTX 4090 is the rational choice. For production serving, large model deployment, and training, the H100’s premium is justified by capabilities that the RTX 4090 physically cannot provide.