Part of Series GPU Hardware & AI Accelerators 5 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

The AMD Instinct MI300X has 192 GB of HBM3 memory at 5.3 TB/s bandwidth. The NVIDIA H100 SXM has 80 GB of HBM3 at 3.35 TB/s. On paper, AMD delivers 2.4x the capacity and 1.58x the bandwidth. For LLM inference — where the decode phase is entirely memory-bandwidth-bound — this should translate to a substantial performance advantage. A 70B model at FP16 (140 GB) fits on a single MI300X but requires two H100s. A 405B model at FP8 (203 GB) fits on two MI300Xs versus six H100s.

The MI300X is real hardware, deployed at scale by major cloud providers (Microsoft Azure, Oracle Cloud). The specs are not marketing fiction. Yet NVIDIA H100s consistently deliver higher inference throughput per GPU in production deployments. This post analyzes why: the hardware architecture, the software ecosystem, the optimization gap, and where each platform genuinely excels.

MI300X Architecture: CDNA3

Chiplet Design

The MI300X uses AMD’s chiplet architecture. Instead of a monolithic die, it consists of:

  • 8 XCD (Accelerator Complex Dies): Each XCD contains 38 Compute Units (CUs), of which 304 are enabled across the 8 XCDs. Manufactured on TSMC 5nm.
  • 4 IOD (I/O Dies): Handle memory controllers, Infinity Fabric links, and PCIe. Manufactured on TSMC 6nm.
  • 8 HBM3 stacks: Connected to the IODs via a silicon interposer.

Total die area: approximately 750 mm2^2 (aggregate across all chiplets). Transistor count: 153 billion.

The chiplet approach has manufacturing advantages: each 5nm XCD is a small die with high yield. If one CU is defective, the XCD can still be used with that CU disabled (304 out of 304 enabled is the full spec; some SKUs may have fewer). The tradeoff is inter-chiplet communication latency — CUs on different XCDs communicate through the Infinity Fabric on the IOD, not through on-die interconnect.

Compute Units vs Streaming Multiprocessors

AMD’s Compute Unit (CU) is the equivalent of NVIDIA’s SM, but with different resource allocations:

📊

Compute Unit (MI300X) vs Streaming Multiprocessor (H100)

ResourceAMD CU (CDNA3)NVIDIA SM (Hopper)Ratio (AMD/NVIDIA)
Units per GPU 304 CUs 132 SMs 2.30x
Vector ALUs (FP32) 64 per CU (19,456 total) 128 per SM (16,896 total) 1.15x
Matrix cores / Tensor cores 4 per CU (1,216 total) 4 per SM (528 total) 2.30x
Register file per unit 256 KB (64K x 32-bit) 256 KB (64K x 32-bit) 1.0x
Shared memory (LDS) per unit 64 KB Up to 228 KB 0.28x
L1 cache per unit 32 KB Up to 256 KB (pool) 0.13x
Wavefront / Warp size 64 threads 32 threads 2.0x
Max wavefronts / warps per unit 32 wavefronts 48 warps 0.67x
Max threads per unit 2,048 1,536 1.33x
Note: AMD CUs have less local memory (LDS + L1) per unit but more units overall. The wavefront size of 64 means each wavefront has 2x the threads of an NVIDIA warp.
ℹ️ Wavefronts vs Warps

AMD uses 64-thread wavefronts where NVIDIA uses 32-thread warps. This means AMD needs more thread-level parallelism per wavefront to avoid divergence penalties. A conditional branch that diverges within a 64-thread wavefront serializes twice as many threads as a 32-thread warp divergence. For LLM inference kernels (which are mostly uniform matrix operations), this difference is minor.

Memory Subsystem

📊

MI300X vs H100 Memory Subsystem

SpecMI300XH100 SXMH200 SXM
HBM Type HBM3 HBM3 HBM3e
HBM Stacks 8 5 6
HBM Capacity 192 GB 80 GB 141 GB
HBM Bandwidth 5,300 GB/s 3,350 GB/s 4,800 GB/s
L2 Cache 256 MB 50 MB 50 MB
Memory Bus Width 8192-bit 5120-bit 6144-bit
Infinity Fabric / NVLink BW 896 GB/s (per GPU) 900 GB/s (per GPU) 900 GB/s
Note: MI300X L2 cache is 5.12x larger than H100's. This matters for KV cache access patterns during attention.

The 256 MB L2 cache on the MI300X is notable. On the H100, the KV cache for a moderately-long context (4K tokens, 70B model) is approximately 10-20 GB — far exceeding the 50 MB L2. On the MI300X, the 256 MB L2 can cache a meaningful portion of the KV cache for short contexts or small models, reducing effective HBM traffic during attention.

Peak Compute

📊

MI300X vs H100 Compute Throughput

PrecisionMI300XH100 SXMRatio (MI300X/H100)
FP64 (TFLOPS) 163.4 66.9 2.44x
FP32 (TFLOPS) 163.4 66.9 2.44x
FP16 Tensor (TFLOPS) 1,307 990 1.32x
BF16 Tensor (TFLOPS) 1,307 990 1.32x
FP8 Tensor (TFLOPS) 2,614 1,979 1.32x
INT8 Tensor (TFLOPS) 2,614 1,979 1.32x
TDP 750W 700W 1.07x
Note: MI300X has higher FP64 throughput (HPC heritage). FP8/FP16 tensor TFLOPS are 32% higher. TDP is only 7% higher.

FP8 Tensor TFLOPS Comparison

(TFLOPS)
H100 SXM 3,350 GB/s HBM3, 80 GB
1,979 TFLOPS
MI300X 5,300 GB/s HBM3, 192 GB
2,614 TFLOPS
+32.1%
H200 SXM 4,800 GB/s HBM3e, 141 GB
1,979 TFLOPS

ROCm: The Software Stack

ROCm Architecture

ROCm (Radeon Open Compute) is AMD’s GPU compute platform. The stack:

  • HIP (Heterogeneous-compute Interface for Portability): The programming API, syntactically nearly identical to CUDA
  • rocBLAS: BLAS library (equivalent to cuBLAS)
  • MIOpen: Deep learning primitives (equivalent to cuDNN)
  • hipBLASLt: Lightweight BLAS with mixed-precision support
  • RCCL: Collective communication library (equivalent to NCCL)
  • Composable Kernel (CK): Kernel fusion framework
  • ROCm SMI: System management (equivalent to nvidia-smi)

HIP: The CUDA Translation Layer

HIP code is syntactically close to CUDA. Most CUDA kernels can be converted with minimal changes:

// CUDA kernel
__global__ void vector_add_cuda(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}

// Launch
vector_add_cuda<<<(n+255)/256, 256>>>(a, b, c, n);

// HIP kernel (identical except for headers)
#include <hip/hip_runtime.h>

__global__ void vector_add_hip(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}

// Launch (identical syntax)
vector_add_hip<<<(n+255)/256, 256>>>(a, b, c, n);

The hipify-perl and hipify-clang tools automate the conversion of CUDA source to HIP:

# Convert CUDA source to HIP
hipify-perl my_cuda_kernel.cu > my_hip_kernel.cpp

# Common replacements:
# cudaMalloc       -> hipMalloc
# cudaMemcpy       -> hipMemcpy
# cudaFree         -> hipFree
# cudaGetDeviceProperties -> hipGetDeviceProperties
# __syncthreads()  -> __syncthreads()  (same)
# atomicAdd        -> atomicAdd        (same)
# cudaStream_t     -> hipStream_t

For simple kernels, this works. For high-performance kernels that use NVIDIA-specific intrinsics (WMMA, WGMMA, TMA, __shfl_sync, cooperative groups), the conversion is non-trivial.

WMMA on ROCm

AMD’s matrix core equivalent uses a different instruction set:

// NVIDIA WMMA (16x16x16, FP16)
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

// AMD MFMA (Matrix Fused Multiply-Add)
// Uses rocWMMA or inline assembly
// Tile sizes differ: 16x16x16, 32x32x8, 16x16x32 (FP8)
#include <rocwmma/rocwmma.hpp>

rocwmma::fragment<rocwmma::matrix_a, 16, 16, 16, half, rocwmma::row_major> a_frag;
rocwmma::fragment<rocwmma::matrix_b, 16, 16, 16, half, rocwmma::col_major> b_frag;
rocwmma::fragment<rocwmma::accumulator, 16, 16, 16, float> c_frag;
rocwmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

The tile sizes and threading models differ at the microarchitectural level (64-thread wavefronts vs 32-thread warps), so optimized GEMM kernels must be substantially rewritten for optimal MI300X performance, not just syntactically translated.

FlashAttention on ROCm

FlashAttention is the most performance-critical kernel for LLM inference. AMD has invested heavily in a ROCm-native implementation through the Composable Kernel (CK) library.

CK FlashAttention

The Composable Kernel (CK) library provides fused attention kernels optimized for CDNA3:

# Building CK with FlashAttention support
git clone https://github.com/ROCmSoftwarePlatform/composable_kernel.git
cd composable_kernel
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
      -DGPU_TARGETS="gfx942" \        # MI300X target
      -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc) ck_fmha

Performance Gap: FlashAttention

📊

FlashAttention-2 Throughput (FP16, batch=1, 128 heads, d=128)

Sequence LengthH100 (TFLOPS)MI300X (TFLOPS)MI300X / H100
512 285 195 0.68x
1024 310 230 0.74x
2048 325 265 0.82x
4096 340 290 0.85x
8192 345 310 0.90x
16384 348 320 0.92x
Note: Numbers are approximate and version-dependent. The gap narrows at longer sequences because the computation becomes more regular and the MI300X's higher memory bandwidth helps. Results improve with each ROCm release.

The MI300X achieves 68-92% of H100 FlashAttention throughput despite having 32% more raw tensor TFLOPS. The gap comes from:

  1. Software maturity: NVIDIA’s cuDNN FlashAttention has been optimized for 3+ years on Hopper. AMD’s CK implementation is newer.
  2. Shared memory limitation: MI300X CUs have 64 KB of LDS vs Hopper’s 228 KB shared memory. FlashAttention relies heavily on tiling data in shared memory — smaller tiles mean more HBM round-trips.
  3. TMA advantage: Hopper’s TMA hardware handles data movement asynchronously, freeing warps for computation. MI300X relies on software-managed loads.
  4. WGMMA efficiency: Hopper’s warp group matrix operations are more efficient per instruction than AMD’s MFMA instructions for the specific tile sizes FlashAttention uses.
⚠️ The Software Gap Is Closing

The MI300X FlashAttention performance numbers improve with each ROCm release. ROCm 6.0 (late 2024) showed 10-15% improvement over ROCm 5.7 for attention kernels. AMD is actively hiring kernel engineers and investing in CK development. The numbers above represent a snapshot, not a permanent state.

Where AMD Wins: Memory Capacity and Bandwidth

Single-GPU Model Hosting

The MI300X’s 192 GB capacity is its most compelling advantage for inference:

📊

Model Hosting: Single-GPU vs Multi-GPU

ModelFP16 SizeFP8 SizeFits on MI300X?Fits on H100?Fits on H200?
Llama 3 8B 16 GB 8 GB Yes (1 GPU) Yes (1 GPU) Yes (1 GPU)
Llama 3 70B 140 GB 70 GB Yes (1 GPU, FP16) No (2 GPUs, FP16) Yes (1 GPU, FP8)
Mixtral 8x7B 94 GB 47 GB Yes (1 GPU) No (2 GPUs, FP16) Yes (1 GPU, FP8)
Llama 3 405B 810 GB 405 GB 3 GPUs (FP8) 6 GPUs (FP8) 3 GPUs (FP8)
DeepSeek V3 (active) 74 GB 37 GB Yes (1 GPU) Yes (1 GPU, FP8) Yes (1 GPU)
Note: Active parameters for DeepSeek V3 (MoE). Full model is 671B but activates 37B per token. KV cache memory not included.

Hosting a model on fewer GPUs has compounding benefits:

  • No tensor parallelism overhead (no all-reduce latency between GPUs)
  • No NVLink/Infinity Fabric bandwidth consumed by communication
  • Lower infrastructure cost (fewer GPUs per serving instance)
  • Simpler deployment (no multi-GPU orchestration)

Bandwidth-Bound Decode

For batch=1 decode, throughput is purely bandwidth-limited:

Tokens/s=Memory Bandwidth2×Model Size (bytes)\text{Tokens/s} = \frac{\text{Memory Bandwidth}}{2 \times \text{Model Size (bytes)}}

Theoretical Batch=1 Decode (70B, FP8, Single GPU)

(tokens/s (theoretical))
H100 (3,350 GB/s) 70 GB weights
24 tokens/s (theoretical)
MI300X (5,300 GB/s) 70 GB weights
38 tokens/s (theoretical)
+58.3%
H200 (4,800 GB/s) 70 GB weights
34 tokens/s (theoretical)
+41.7%

At batch=1, the MI300X’s 58% bandwidth advantage should deliver 58% more tokens per second. In practice, the measured advantage is typically 30-45% due to software overhead — kernel launch latency, less-optimized GEMV kernels, and attention kernel differences.

Large-Batch Inference

At large batch sizes, compute becomes the bottleneck. Here the MI300X’s 32% FP8 TFLOPS advantage should matter, but:

📊

Large-Batch Inference (70B, FP8, 8 GPUs)

Metric8x MI300X8x H100MI300X Advantage
Aggregate FP8 TFLOPS 20,912 15,832 1.32x
Aggregate HBM BW 42.4 TB/s 26.8 TB/s 1.58x
Aggregate HBM Capacity 1,536 GB 640 GB 2.40x
Interconnect BW per GPU 896 GB/s 900 GB/s 1.00x
Measured GEMM throughput ~75% of peak ~85% of peak 0.88x efficiency
Effective throughput (batch=256) ~15,700 TFLOPS ~13,500 TFLOPS 1.16x
Note: GEMM efficiency on MI300X is lower due to kernel optimization gap. NVIDIA's cuBLAS achieves higher percentage of peak TFLOPS.

The 32% hardware advantage shrinks to 16% effective advantage at large batch because of the GEMM efficiency gap.

Where AMD Loses: The Software Ecosystem

The CUDA Moat

The NVIDIA software ecosystem represents over a decade of optimization:

📊

Software Ecosystem Comparison

ComponentNVIDIAAMDGap Assessment
BLAS Library cuBLAS (mature, auto-tuned) rocBLAS (good, less auto-tuning) Moderate gap
DNN Library cuDNN 9 (fused kernels, graph API) MIOpen (fewer fusions) Large gap
FlashAttention Native (cuDNN), FA3 for Hopper CK-based FA2 port Moderate gap
Quantization (FP8) Transformer Engine (automatic) Manual FP8 support Large gap
Profiler Nsight Compute/Systems (excellent) rocProf, Omniperf (improving) Moderate gap
vLLM Support First-class, all features Supported, some features missing Moderate gap
TensorRT-LLM Native (NVIDIA only) N/A No equivalent
Triton (compiler) Native support, mature ROCm backend, improving Moderate gap
PyTorch CUDA backend (default) ROCm backend (supported) Small gap for training

The most impactful gap is the lack of a TensorRT-LLM equivalent on AMD. TensorRT-LLM provides fused attention, quantized GEMM, in-flight batching, paged KV cache, and speculative decoding — all optimized for NVIDIA hardware. AMD relies on vLLM (which works but with fewer optimizations) or custom solutions.

Kernel Optimization Gap in Practice

The kernel optimization gap manifests as lower utilization of theoretical hardware capability:

Achieved % of Peak FP8 TFLOPS (GEMM, M=N=K=4096)

(% of peak)
H100 (cuBLAS) 1,722 of 1,979 TFLOPS
87 % of peak
MI300X (rocBLAS) 1,882 of 2,614 TFLOPS
72 % of peak
MI300X (hipBLASLt) 1,987 of 2,614 TFLOPS
76 % of peak

NVIDIA’s cuBLAS achieves 87% of peak tensor TFLOPS for large GEMMs. AMD’s rocBLAS achieves 72-76%. In absolute TFLOPS, the MI300X still wins (1,882-1,987 vs 1,722), but the efficiency gap means AMD’s 32% hardware advantage translates to only 10-15% throughput advantage.

For smaller GEMMs (which dominate small-batch inference), the gap is larger because AMD’s library tuning for non-standard matrix shapes is less comprehensive.

Real Inference Benchmarks

vLLM on MI300X vs H100

vLLM supports both CUDA and ROCm backends. Benchmark results for Llama 3 70B serving:

📊

vLLM Llama 3 70B Serving (FP16, Tensor Parallel)

ConfigurationInput 256 / Output 128Input 1024 / Output 256Input 4096 / Output 512
1x MI300X (TP=1, fits in 192 GB) 32 tok/s 28 tok/s 22 tok/s
2x H100 (TP=2, needs 160 GB) 29 tok/s 26 tok/s 21 tok/s
1x H200 (TP=1, fits in 141 GB at FP8) 36 tok/s 31 tok/s 25 tok/s
4x MI300X (TP=4, high batch) 4,200 tok/s 3,100 tok/s 1,800 tok/s
4x H100 (TP=4, high batch) 3,800 tok/s 2,900 tok/s 1,700 tok/s
Note: Single-user latency (batch=1) measurements. High-batch numbers are aggregate throughput with continuous batching. Results are approximate and depend heavily on vLLM version and ROCm version.

Key observations:

  • Single-GPU MI300X vs 2-GPU H100: MI300X wins by 10% despite lower software efficiency, because it avoids TP communication overhead entirely
  • Single-GPU H200 vs MI300X: H200 wins by 12% at FP8 because of better kernel efficiency, even with less bandwidth (4,800 vs 5,300 GB/s)
  • Multi-GPU scaling: MI300X scales similarly to H100 at 4-GPU TP, with the bandwidth advantage providing a consistent 10% edge

Cost-Adjusted Performance

The MI300X is priced at approximately 10,00015,000perunit(variesbyvolume),comparedto10,000-15,000 per unit (varies by volume), compared to 25,000-35,000 for an H100 SXM. At half the price with competitive performance, the cost per token is substantially better on MI300X:

Approximate Cost per Million Tokens (70B, FP16, batch=1)

($ per 1M output tokens (amortized))
1x MI300X ($12K) 32 tok/s, single GPU
0.1 $ per 1M output tokens (amortized)
2x H100 ($60K) 29 tok/s, TP=2
0.29 $ per 1M output tokens (amortized)
+190.0%
1x H200 ($35K) 36 tok/s, single GPU
0.17 $ per 1M output tokens (amortized)
+70.0%
The Economics Argument

For workloads where the software ecosystem gap is manageable (standard model architectures, vLLM serving, no custom kernel requirements), the MI300X offers 2-3x better cost-efficiency than H100. This is why cloud providers are deploying MI300X instances — the price-performance ratio is compelling even with the software overhead.

Implementation: Porting CUDA to HIP

Building for MI300X

# Install ROCm (Ubuntu)
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/6.0 jammy main' | \
  sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update && sudo apt install rocm-dev

# Compile HIP code for MI300X (gfx942)
hipcc --offload-arch=gfx942 -O3 my_kernel.cpp -o my_kernel

# Check GPU info (equivalent to nvidia-smi)
rocm-smi
rocm-smi --showmeminfo all
rocminfo | grep "gfx"

Bandwidth Measurement on MI300X

#include <hip/hip_runtime.h>
#include <cstdio>

void measure_hbm_bandwidth(size_t size_mb) {
    size_t size = size_mb * 1024 * 1024;
    void *d_src, *d_dst;
    hipMalloc(&d_src, size);
    hipMalloc(&d_dst, size);
    hipMemset(d_src, 1, size);

    hipEvent_t start, stop;
    hipEventCreate(&start);
    hipEventCreate(&stop);

    // Warm up
    hipMemcpy(d_dst, d_src, size, hipMemcpyDeviceToDevice);
    hipDeviceSynchronize();

    int iterations = 100;
    hipEventRecord(start);
    for (int i = 0; i < iterations; i++) {
        hipMemcpy(d_dst, d_src, size, hipMemcpyDeviceToDevice);
    }
    hipEventRecord(stop);
    hipDeviceSynchronize();

    float ms;
    hipEventElapsedTime(&ms, start, stop);
    double bandwidth = (2.0 * size * iterations) / (ms / 1000.0) / 1e9;

    printf("MI300X HBM Bandwidth: Size=%zu MB, BW=%.1f GB/s (%.1f%% of 5,300 GB/s)\n",
           size_mb, bandwidth, bandwidth / 5300.0 * 100.0);

    hipFree(d_src);
    hipFree(d_dst);
    hipEventDestroy(start);
    hipEventDestroy(stop);
}

int main() {
    hipDeviceProp_t prop;
    hipGetDeviceProperties(&prop, 0);
    printf("Device: %s\n", prop.name);
    printf("CUs: %d\n", prop.multiProcessorCount);
    printf("Total memory: %zu MB\n", prop.totalGlobalMem / (1024*1024));
    printf("Memory clock: %d MHz\n", prop.memoryClockRate / 1000);
    printf("Memory bus: %d bits\n", prop.memoryBusWidth);
    printf("\n");

    measure_hbm_bandwidth(256);
    measure_hbm_bandwidth(1024);
    measure_hbm_bandwidth(4096);
    return 0;
}

Expected results: approximately 4,800-5,000 GB/s for large copies (90-94% of theoretical 5,300 GB/s).

Profiling on ROCm

# Basic profiling with rocprof
rocprof --stats ./my_binary
# Generates results.stats.csv with kernel duration, memory operations

# Detailed performance counters with Omniperf
omniperf profile -n my_profile -- ./my_binary
omniperf analyze -p my_profile/
# Shows: compute utilization, memory bandwidth, cache hit rates, LDS utilization

# ROCm SMI monitoring (equivalent to nvidia-smi dmon)
rocm-smi -d 0 --showuse --showmemuse --showtemp --showpower

MI300X vs H100 vs H200: When to Choose What

📊

GPU Selection Guide for LLM Inference

ScenarioBest ChoiceReasoning
70B model, cost-sensitive MI300X Single GPU, 2-3x cheaper than 2x H100
70B model, latency-sensitive H200 Better kernel efficiency, FP8 support
405B model, 8-GPU node 8x H100/H200 TensorRT-LLM optimizations, NVLink
405B model, fewer GPUs MI300X Fewer GPUs needed (3 vs 6 at FP8)
Custom kernels required H100/H200 CUDA ecosystem, Nsight Compute
vLLM + standard models MI300X Good ROCm support, better cost/tok
FP4 quantization B200 Only NVIDIA supports FP4 in hardware

The MI300X is the right choice when: (1) memory capacity is the constraint, (2) cost-per-token matters more than absolute latency, (3) you are running standard model architectures served by vLLM or similar frameworks, and (4) you do not need custom kernel development. The H100/H200 is the right choice when: (1) you need the absolute lowest latency, (2) you require TensorRT-LLM or custom CUDA kernels, (3) your workload benefits from Hopper-specific features (TMA, WGMMA, FP8 Transformer Engine).

The software gap is real but narrowing. AMD’s investment in ROCm, the Composable Kernel library, and direct contributions to PyTorch and vLLM is paying off. The MI300X is not a paper competitor — it is a deployable, production-grade AI accelerator with a compelling price-performance ratio. The question is whether your specific workload can tolerate the 15-25% kernel efficiency gap in exchange for 2-3x better economics.