The AMD Instinct MI300X has 192 GB of HBM3 memory at 5.3 TB/s bandwidth. The NVIDIA H100 SXM has 80 GB of HBM3 at 3.35 TB/s. On paper, AMD delivers 2.4x the capacity and 1.58x the bandwidth. For LLM inference — where the decode phase is entirely memory-bandwidth-bound — this should translate to a substantial performance advantage. A 70B model at FP16 (140 GB) fits on a single MI300X but requires two H100s. A 405B model at FP8 (203 GB) fits on two MI300Xs versus six H100s.
The MI300X is real hardware, deployed at scale by major cloud providers (Microsoft Azure, Oracle Cloud). The specs are not marketing fiction. Yet NVIDIA H100s consistently deliver higher inference throughput per GPU in production deployments. This post analyzes why: the hardware architecture, the software ecosystem, the optimization gap, and where each platform genuinely excels.
MI300X Architecture: CDNA3
Chiplet Design
The MI300X uses AMD’s chiplet architecture. Instead of a monolithic die, it consists of:
- 8 XCD (Accelerator Complex Dies): Each XCD contains 38 Compute Units (CUs), of which 304 are enabled across the 8 XCDs. Manufactured on TSMC 5nm.
- 4 IOD (I/O Dies): Handle memory controllers, Infinity Fabric links, and PCIe. Manufactured on TSMC 6nm.
- 8 HBM3 stacks: Connected to the IODs via a silicon interposer.
Total die area: approximately 750 mm (aggregate across all chiplets). Transistor count: 153 billion.
The chiplet approach has manufacturing advantages: each 5nm XCD is a small die with high yield. If one CU is defective, the XCD can still be used with that CU disabled (304 out of 304 enabled is the full spec; some SKUs may have fewer). The tradeoff is inter-chiplet communication latency — CUs on different XCDs communicate through the Infinity Fabric on the IOD, not through on-die interconnect.
Compute Units vs Streaming Multiprocessors
AMD’s Compute Unit (CU) is the equivalent of NVIDIA’s SM, but with different resource allocations:
Compute Unit (MI300X) vs Streaming Multiprocessor (H100)
| Resource | AMD CU (CDNA3) | NVIDIA SM (Hopper) | Ratio (AMD/NVIDIA) |
|---|---|---|---|
| Units per GPU | 304 CUs | 132 SMs | 2.30x |
| Vector ALUs (FP32) | 64 per CU (19,456 total) | 128 per SM (16,896 total) | 1.15x |
| Matrix cores / Tensor cores | 4 per CU (1,216 total) | 4 per SM (528 total) | 2.30x |
| Register file per unit | 256 KB (64K x 32-bit) | 256 KB (64K x 32-bit) | 1.0x |
| Shared memory (LDS) per unit | 64 KB | Up to 228 KB | 0.28x |
| L1 cache per unit | 32 KB | Up to 256 KB (pool) | 0.13x |
| Wavefront / Warp size | 64 threads | 32 threads | 2.0x |
| Max wavefronts / warps per unit | 32 wavefronts | 48 warps | 0.67x |
| Max threads per unit | 2,048 | 1,536 | 1.33x |
AMD uses 64-thread wavefronts where NVIDIA uses 32-thread warps. This means AMD needs more thread-level parallelism per wavefront to avoid divergence penalties. A conditional branch that diverges within a 64-thread wavefront serializes twice as many threads as a 32-thread warp divergence. For LLM inference kernels (which are mostly uniform matrix operations), this difference is minor.
Memory Subsystem
MI300X vs H100 Memory Subsystem
| Spec | MI300X | H100 SXM | H200 SXM |
|---|---|---|---|
| HBM Type | HBM3 | HBM3 | HBM3e |
| HBM Stacks | 8 | 5 | 6 |
| HBM Capacity | 192 GB | 80 GB | 141 GB |
| HBM Bandwidth | 5,300 GB/s | 3,350 GB/s | 4,800 GB/s |
| L2 Cache | 256 MB | 50 MB | 50 MB |
| Memory Bus Width | 8192-bit | 5120-bit | 6144-bit |
| Infinity Fabric / NVLink BW | 896 GB/s (per GPU) | 900 GB/s (per GPU) | 900 GB/s |
The 256 MB L2 cache on the MI300X is notable. On the H100, the KV cache for a moderately-long context (4K tokens, 70B model) is approximately 10-20 GB — far exceeding the 50 MB L2. On the MI300X, the 256 MB L2 can cache a meaningful portion of the KV cache for short contexts or small models, reducing effective HBM traffic during attention.
Peak Compute
MI300X vs H100 Compute Throughput
| Precision | MI300X | H100 SXM | Ratio (MI300X/H100) |
|---|---|---|---|
| FP64 (TFLOPS) | 163.4 | 66.9 | 2.44x |
| FP32 (TFLOPS) | 163.4 | 66.9 | 2.44x |
| FP16 Tensor (TFLOPS) | 1,307 | 990 | 1.32x |
| BF16 Tensor (TFLOPS) | 1,307 | 990 | 1.32x |
| FP8 Tensor (TFLOPS) | 2,614 | 1,979 | 1.32x |
| INT8 Tensor (TFLOPS) | 2,614 | 1,979 | 1.32x |
| TDP | 750W | 700W | 1.07x |
FP8 Tensor TFLOPS Comparison
(TFLOPS)ROCm: The Software Stack
ROCm Architecture
ROCm (Radeon Open Compute) is AMD’s GPU compute platform. The stack:
- HIP (Heterogeneous-compute Interface for Portability): The programming API, syntactically nearly identical to CUDA
- rocBLAS: BLAS library (equivalent to cuBLAS)
- MIOpen: Deep learning primitives (equivalent to cuDNN)
- hipBLASLt: Lightweight BLAS with mixed-precision support
- RCCL: Collective communication library (equivalent to NCCL)
- Composable Kernel (CK): Kernel fusion framework
- ROCm SMI: System management (equivalent to nvidia-smi)
HIP: The CUDA Translation Layer
HIP code is syntactically close to CUDA. Most CUDA kernels can be converted with minimal changes:
// CUDA kernel
__global__ void vector_add_cuda(float *a, float *b, float *c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) c[idx] = a[idx] + b[idx];
}
// Launch
vector_add_cuda<<<(n+255)/256, 256>>>(a, b, c, n);
// HIP kernel (identical except for headers)
#include <hip/hip_runtime.h>
__global__ void vector_add_hip(float *a, float *b, float *c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) c[idx] = a[idx] + b[idx];
}
// Launch (identical syntax)
vector_add_hip<<<(n+255)/256, 256>>>(a, b, c, n);
The hipify-perl and hipify-clang tools automate the conversion of CUDA source to HIP:
# Convert CUDA source to HIP
hipify-perl my_cuda_kernel.cu > my_hip_kernel.cpp
# Common replacements:
# cudaMalloc -> hipMalloc
# cudaMemcpy -> hipMemcpy
# cudaFree -> hipFree
# cudaGetDeviceProperties -> hipGetDeviceProperties
# __syncthreads() -> __syncthreads() (same)
# atomicAdd -> atomicAdd (same)
# cudaStream_t -> hipStream_t
For simple kernels, this works. For high-performance kernels that use NVIDIA-specific intrinsics (WMMA, WGMMA, TMA, __shfl_sync, cooperative groups), the conversion is non-trivial.
WMMA on ROCm
AMD’s matrix core equivalent uses a different instruction set:
// NVIDIA WMMA (16x16x16, FP16)
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
// AMD MFMA (Matrix Fused Multiply-Add)
// Uses rocWMMA or inline assembly
// Tile sizes differ: 16x16x16, 32x32x8, 16x16x32 (FP8)
#include <rocwmma/rocwmma.hpp>
rocwmma::fragment<rocwmma::matrix_a, 16, 16, 16, half, rocwmma::row_major> a_frag;
rocwmma::fragment<rocwmma::matrix_b, 16, 16, 16, half, rocwmma::col_major> b_frag;
rocwmma::fragment<rocwmma::accumulator, 16, 16, 16, float> c_frag;
rocwmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
The tile sizes and threading models differ at the microarchitectural level (64-thread wavefronts vs 32-thread warps), so optimized GEMM kernels must be substantially rewritten for optimal MI300X performance, not just syntactically translated.
FlashAttention on ROCm
FlashAttention is the most performance-critical kernel for LLM inference. AMD has invested heavily in a ROCm-native implementation through the Composable Kernel (CK) library.
CK FlashAttention
The Composable Kernel (CK) library provides fused attention kernels optimized for CDNA3:
# Building CK with FlashAttention support
git clone https://github.com/ROCmSoftwarePlatform/composable_kernel.git
cd composable_kernel
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DGPU_TARGETS="gfx942" \ # MI300X target
-DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc) ck_fmha
Performance Gap: FlashAttention
FlashAttention-2 Throughput (FP16, batch=1, 128 heads, d=128)
| Sequence Length | H100 (TFLOPS) | MI300X (TFLOPS) | MI300X / H100 |
|---|---|---|---|
| 512 | 285 | 195 | 0.68x |
| 1024 | 310 | 230 | 0.74x |
| 2048 | 325 | 265 | 0.82x |
| 4096 | 340 | 290 | 0.85x |
| 8192 | 345 | 310 | 0.90x |
| 16384 | 348 | 320 | 0.92x |
The MI300X achieves 68-92% of H100 FlashAttention throughput despite having 32% more raw tensor TFLOPS. The gap comes from:
- Software maturity: NVIDIA’s cuDNN FlashAttention has been optimized for 3+ years on Hopper. AMD’s CK implementation is newer.
- Shared memory limitation: MI300X CUs have 64 KB of LDS vs Hopper’s 228 KB shared memory. FlashAttention relies heavily on tiling data in shared memory — smaller tiles mean more HBM round-trips.
- TMA advantage: Hopper’s TMA hardware handles data movement asynchronously, freeing warps for computation. MI300X relies on software-managed loads.
- WGMMA efficiency: Hopper’s warp group matrix operations are more efficient per instruction than AMD’s MFMA instructions for the specific tile sizes FlashAttention uses.
The MI300X FlashAttention performance numbers improve with each ROCm release. ROCm 6.0 (late 2024) showed 10-15% improvement over ROCm 5.7 for attention kernels. AMD is actively hiring kernel engineers and investing in CK development. The numbers above represent a snapshot, not a permanent state.
Where AMD Wins: Memory Capacity and Bandwidth
Single-GPU Model Hosting
The MI300X’s 192 GB capacity is its most compelling advantage for inference:
Model Hosting: Single-GPU vs Multi-GPU
| Model | FP16 Size | FP8 Size | Fits on MI300X? | Fits on H100? | Fits on H200? |
|---|---|---|---|---|---|
| Llama 3 8B | 16 GB | 8 GB | Yes (1 GPU) | Yes (1 GPU) | Yes (1 GPU) |
| Llama 3 70B | 140 GB | 70 GB | Yes (1 GPU, FP16) | No (2 GPUs, FP16) | Yes (1 GPU, FP8) |
| Mixtral 8x7B | 94 GB | 47 GB | Yes (1 GPU) | No (2 GPUs, FP16) | Yes (1 GPU, FP8) |
| Llama 3 405B | 810 GB | 405 GB | 3 GPUs (FP8) | 6 GPUs (FP8) | 3 GPUs (FP8) |
| DeepSeek V3 (active) | 74 GB | 37 GB | Yes (1 GPU) | Yes (1 GPU, FP8) | Yes (1 GPU) |
Hosting a model on fewer GPUs has compounding benefits:
- No tensor parallelism overhead (no all-reduce latency between GPUs)
- No NVLink/Infinity Fabric bandwidth consumed by communication
- Lower infrastructure cost (fewer GPUs per serving instance)
- Simpler deployment (no multi-GPU orchestration)
Bandwidth-Bound Decode
For batch=1 decode, throughput is purely bandwidth-limited:
Theoretical Batch=1 Decode (70B, FP8, Single GPU)
(tokens/s (theoretical))At batch=1, the MI300X’s 58% bandwidth advantage should deliver 58% more tokens per second. In practice, the measured advantage is typically 30-45% due to software overhead — kernel launch latency, less-optimized GEMV kernels, and attention kernel differences.
Large-Batch Inference
At large batch sizes, compute becomes the bottleneck. Here the MI300X’s 32% FP8 TFLOPS advantage should matter, but:
Large-Batch Inference (70B, FP8, 8 GPUs)
| Metric | 8x MI300X | 8x H100 | MI300X Advantage |
|---|---|---|---|
| Aggregate FP8 TFLOPS | 20,912 | 15,832 | 1.32x |
| Aggregate HBM BW | 42.4 TB/s | 26.8 TB/s | 1.58x |
| Aggregate HBM Capacity | 1,536 GB | 640 GB | 2.40x |
| Interconnect BW per GPU | 896 GB/s | 900 GB/s | 1.00x |
| Measured GEMM throughput | ~75% of peak | ~85% of peak | 0.88x efficiency |
| Effective throughput (batch=256) | ~15,700 TFLOPS | ~13,500 TFLOPS | 1.16x |
The 32% hardware advantage shrinks to 16% effective advantage at large batch because of the GEMM efficiency gap.
Where AMD Loses: The Software Ecosystem
The CUDA Moat
The NVIDIA software ecosystem represents over a decade of optimization:
Software Ecosystem Comparison
| Component | NVIDIA | AMD | Gap Assessment |
|---|---|---|---|
| BLAS Library | cuBLAS (mature, auto-tuned) | rocBLAS (good, less auto-tuning) | Moderate gap |
| DNN Library | cuDNN 9 (fused kernels, graph API) | MIOpen (fewer fusions) | Large gap |
| FlashAttention | Native (cuDNN), FA3 for Hopper | CK-based FA2 port | Moderate gap |
| Quantization (FP8) | Transformer Engine (automatic) | Manual FP8 support | Large gap |
| Profiler | Nsight Compute/Systems (excellent) | rocProf, Omniperf (improving) | Moderate gap |
| vLLM Support | First-class, all features | Supported, some features missing | Moderate gap |
| TensorRT-LLM | Native (NVIDIA only) | N/A | No equivalent |
| Triton (compiler) | Native support, mature | ROCm backend, improving | Moderate gap |
| PyTorch | CUDA backend (default) | ROCm backend (supported) | Small gap for training |
The most impactful gap is the lack of a TensorRT-LLM equivalent on AMD. TensorRT-LLM provides fused attention, quantized GEMM, in-flight batching, paged KV cache, and speculative decoding — all optimized for NVIDIA hardware. AMD relies on vLLM (which works but with fewer optimizations) or custom solutions.
Kernel Optimization Gap in Practice
The kernel optimization gap manifests as lower utilization of theoretical hardware capability:
Achieved % of Peak FP8 TFLOPS (GEMM, M=N=K=4096)
(% of peak)NVIDIA’s cuBLAS achieves 87% of peak tensor TFLOPS for large GEMMs. AMD’s rocBLAS achieves 72-76%. In absolute TFLOPS, the MI300X still wins (1,882-1,987 vs 1,722), but the efficiency gap means AMD’s 32% hardware advantage translates to only 10-15% throughput advantage.
For smaller GEMMs (which dominate small-batch inference), the gap is larger because AMD’s library tuning for non-standard matrix shapes is less comprehensive.
Real Inference Benchmarks
vLLM on MI300X vs H100
vLLM supports both CUDA and ROCm backends. Benchmark results for Llama 3 70B serving:
vLLM Llama 3 70B Serving (FP16, Tensor Parallel)
| Configuration | Input 256 / Output 128 | Input 1024 / Output 256 | Input 4096 / Output 512 |
|---|---|---|---|
| 1x MI300X (TP=1, fits in 192 GB) | 32 tok/s | 28 tok/s | 22 tok/s |
| 2x H100 (TP=2, needs 160 GB) | 29 tok/s | 26 tok/s | 21 tok/s |
| 1x H200 (TP=1, fits in 141 GB at FP8) | 36 tok/s | 31 tok/s | 25 tok/s |
| 4x MI300X (TP=4, high batch) | 4,200 tok/s | 3,100 tok/s | 1,800 tok/s |
| 4x H100 (TP=4, high batch) | 3,800 tok/s | 2,900 tok/s | 1,700 tok/s |
Key observations:
- Single-GPU MI300X vs 2-GPU H100: MI300X wins by 10% despite lower software efficiency, because it avoids TP communication overhead entirely
- Single-GPU H200 vs MI300X: H200 wins by 12% at FP8 because of better kernel efficiency, even with less bandwidth (4,800 vs 5,300 GB/s)
- Multi-GPU scaling: MI300X scales similarly to H100 at 4-GPU TP, with the bandwidth advantage providing a consistent 10% edge
Cost-Adjusted Performance
The MI300X is priced at approximately 25,000-35,000 for an H100 SXM. At half the price with competitive performance, the cost per token is substantially better on MI300X:
Approximate Cost per Million Tokens (70B, FP16, batch=1)
($ per 1M output tokens (amortized))For workloads where the software ecosystem gap is manageable (standard model architectures, vLLM serving, no custom kernel requirements), the MI300X offers 2-3x better cost-efficiency than H100. This is why cloud providers are deploying MI300X instances — the price-performance ratio is compelling even with the software overhead.
Implementation: Porting CUDA to HIP
Building for MI300X
# Install ROCm (Ubuntu)
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/6.0 jammy main' | \
sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update && sudo apt install rocm-dev
# Compile HIP code for MI300X (gfx942)
hipcc --offload-arch=gfx942 -O3 my_kernel.cpp -o my_kernel
# Check GPU info (equivalent to nvidia-smi)
rocm-smi
rocm-smi --showmeminfo all
rocminfo | grep "gfx"
Bandwidth Measurement on MI300X
#include <hip/hip_runtime.h>
#include <cstdio>
void measure_hbm_bandwidth(size_t size_mb) {
size_t size = size_mb * 1024 * 1024;
void *d_src, *d_dst;
hipMalloc(&d_src, size);
hipMalloc(&d_dst, size);
hipMemset(d_src, 1, size);
hipEvent_t start, stop;
hipEventCreate(&start);
hipEventCreate(&stop);
// Warm up
hipMemcpy(d_dst, d_src, size, hipMemcpyDeviceToDevice);
hipDeviceSynchronize();
int iterations = 100;
hipEventRecord(start);
for (int i = 0; i < iterations; i++) {
hipMemcpy(d_dst, d_src, size, hipMemcpyDeviceToDevice);
}
hipEventRecord(stop);
hipDeviceSynchronize();
float ms;
hipEventElapsedTime(&ms, start, stop);
double bandwidth = (2.0 * size * iterations) / (ms / 1000.0) / 1e9;
printf("MI300X HBM Bandwidth: Size=%zu MB, BW=%.1f GB/s (%.1f%% of 5,300 GB/s)\n",
size_mb, bandwidth, bandwidth / 5300.0 * 100.0);
hipFree(d_src);
hipFree(d_dst);
hipEventDestroy(start);
hipEventDestroy(stop);
}
int main() {
hipDeviceProp_t prop;
hipGetDeviceProperties(&prop, 0);
printf("Device: %s\n", prop.name);
printf("CUs: %d\n", prop.multiProcessorCount);
printf("Total memory: %zu MB\n", prop.totalGlobalMem / (1024*1024));
printf("Memory clock: %d MHz\n", prop.memoryClockRate / 1000);
printf("Memory bus: %d bits\n", prop.memoryBusWidth);
printf("\n");
measure_hbm_bandwidth(256);
measure_hbm_bandwidth(1024);
measure_hbm_bandwidth(4096);
return 0;
}
Expected results: approximately 4,800-5,000 GB/s for large copies (90-94% of theoretical 5,300 GB/s).
Profiling on ROCm
# Basic profiling with rocprof
rocprof --stats ./my_binary
# Generates results.stats.csv with kernel duration, memory operations
# Detailed performance counters with Omniperf
omniperf profile -n my_profile -- ./my_binary
omniperf analyze -p my_profile/
# Shows: compute utilization, memory bandwidth, cache hit rates, LDS utilization
# ROCm SMI monitoring (equivalent to nvidia-smi dmon)
rocm-smi -d 0 --showuse --showmemuse --showtemp --showpower
MI300X vs H100 vs H200: When to Choose What
GPU Selection Guide for LLM Inference
| Scenario | Best Choice | Reasoning |
|---|---|---|
| 70B model, cost-sensitive | MI300X | Single GPU, 2-3x cheaper than 2x H100 |
| 70B model, latency-sensitive | H200 | Better kernel efficiency, FP8 support |
| 405B model, 8-GPU node | 8x H100/H200 | TensorRT-LLM optimizations, NVLink |
| 405B model, fewer GPUs | MI300X | Fewer GPUs needed (3 vs 6 at FP8) |
| Custom kernels required | H100/H200 | CUDA ecosystem, Nsight Compute |
| vLLM + standard models | MI300X | Good ROCm support, better cost/tok |
| FP4 quantization | B200 | Only NVIDIA supports FP4 in hardware |
The MI300X is the right choice when: (1) memory capacity is the constraint, (2) cost-per-token matters more than absolute latency, (3) you are running standard model architectures served by vLLM or similar frameworks, and (4) you do not need custom kernel development. The H100/H200 is the right choice when: (1) you need the absolute lowest latency, (2) you require TensorRT-LLM or custom CUDA kernels, (3) your workload benefits from Hopper-specific features (TMA, WGMMA, FP8 Transformer Engine).
The software gap is real but narrowing. AMD’s investment in ROCm, the Composable Kernel library, and direct contributions to PyTorch and vLLM is paying off. The MI300X is not a paper competitor — it is a deployable, production-grade AI accelerator with a compelling price-performance ratio. The question is whether your specific workload can tolerate the 15-25% kernel efficiency gap in exchange for 2-3x better economics.