CUDA’s dominance in GPU computing is not primarily a hardware story. NVIDIA’s GPUs are excellent, but AMD’s MI300X has competitive raw specifications. The dominance is a software story: CUDA provides a programming model with 17 years of tooling, libraries (cuBLAS, cuDNN, cuSPARSE, NCCL, TensorRT), profiling tools (Nsight Compute, Nsight Systems), and an ecosystem of frameworks (PyTorch, TensorFlow, JAX) that are deeply integrated with CUDA-specific features.
Every alternative GPU programming model faces the same challenge: matching not just the hardware performance but the entire software stack. This post compares the major alternatives on their programming model, current maturity for AI workloads, performance relative to CUDA, and practical usability.
CUDA: The Reference Model
Execution Model
CUDA’s execution model is built on three hierarchical abstractions: grids of thread blocks, thread blocks of warps, and warps of 32 threads that execute in lockstep (SIMT):
// CUDA kernel: explicit thread hierarchy
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
// Global thread ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
// Launch configuration
int threads_per_block = 256;
int blocks = (n + threads_per_block - 1) / threads_per_block;
vector_add<<<blocks, threads_per_block>>>(d_a, d_b, d_c, n);
// Memory model:
// - Global memory (HBM): visible to all threads, high latency (~500 cycles)
// - Shared memory: per-block, low latency (~30 cycles), programmer-managed
// - Registers: per-thread, fastest
// - Constant memory: cached, broadcast to warp
// - Texture memory: cached, optimized for 2D spatial locality
// Synchronization:
// - __syncthreads(): barrier within a thread block
// - __syncwarp(): barrier within a warp (Volta+)
// - atomicAdd/CAS: device-wide atomic operations
// - cooperative_groups: flexible synchronization
CUDA’s Software Ecosystem Advantage
NVIDIA CUDA ecosystem components (non-exhaustive):
Math Libraries:
cuBLAS - BLAS (GEMM, GEMV) with tensor core support
cuBLASLt - Lightweight BLAS with epilogue fusion
cuSPARSE - Sparse matrix operations
cuSPARSELt - 2:4 structured sparsity
cuSOLVER - Dense/sparse linear solvers
cuFFT - Fast Fourier Transform
cuRAND - Random number generation
Deep Learning:
cuDNN - Convolutions, normalization, attention, RNNs
TensorRT - Inference optimization and deployment
TensorRT-LLM - LLM-specific inference engine
cuTENSOR - Tensor contractions
Communication:
NCCL - Multi-GPU/multi-node collective communication
NVSHMEM - Symmetric heap memory for GPU-to-GPU access
NVLink/NVSwitch - Hardware interconnect
Profiling:
Nsight Compute - Kernel-level performance analysis
Nsight Systems - System-level timeline profiling
CUPTI - Profiling API
Compiler/Runtime:
nvcc - CUDA compiler
PTX - Virtual ISA (portable across GPU generations)
CUDA Runtime - Memory management, streams, events
CUDA Driver - Low-level GPU control
Specialized:
CUTLASS - Template library for GEMM/convolution kernels
CUB - Block-level and device-level primitives
Thrust - STL-like parallel algorithms
FlashAttention - Fused attention (third-party, CUDA-specific)
The ecosystem is the moat. Each library in CUDA’s stack has been optimized by hundreds of engineers over years. A competitor must either replicate this entire stack or provide automated translation. Neither is straightforward.
AMD ROCm and HIP
HIP: The CUDA Compatibility Layer
AMD’s HIP (Heterogeneous-compute Interface for Portability) provides a CUDA-like programming model that can compile for both AMD and NVIDIA GPUs:
// HIP kernel: nearly identical syntax to CUDA
#include <hip/hip_runtime.h>
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
int main() {
// API is nearly 1:1 with CUDA (prefix changed from cuda to hip)
float *d_a, *d_b, *d_c;
hipMalloc(&d_a, n * sizeof(float));
hipMalloc(&d_b, n * sizeof(float));
hipMalloc(&d_c, n * sizeof(float));
hipMemcpy(d_a, h_a, n * sizeof(float), hipMemcpyHostToDevice);
hipMemcpy(d_b, h_b, n * sizeof(float), hipMemcpyHostToDevice);
int threads = 256;
int blocks = (n + threads - 1) / threads;
vector_add<<<blocks, threads>>>(d_a, d_b, d_c, n);
hipMemcpy(h_c, d_c, n * sizeof(float), hipMemcpyDeviceToHost);
hipFree(d_a);
hipFree(d_b);
hipFree(d_c);
return 0;
}
// Compilation:
// AMD GPU: hipcc vector_add.cpp -o vector_add
// NVIDIA GPU: hipcc --platform nvidia vector_add.cpp -o vector_add
Key Differences from CUDA
HIP/ROCm vs CUDA differences:
1. Warp size:
CUDA: warp = 32 threads (NVIDIA standard since G80)
ROCm: wavefront = 64 threads (AMD CDNA/RDNA)
Impact: warp-level primitives (__shfl, __ballot) operate on different widths
HIP provides: warpSize variable (64 on AMD, 32 on NVIDIA)
2. Shared memory:
CUDA: configurable L1/shared memory split (up to 228 KB on H100)
ROCm: LDS (Local Data Share) is separate from L1 (64 KB per CU on MI300X)
Impact: algorithms tuned for large shared memory may need adjustment
3. Memory model:
CUDA: relaxed memory model with explicit fences
ROCm: similar model, but different cache hierarchy
MI300X: 256 MB Infinity Cache (much larger than NVIDIA L2)
4. Tensor cores vs Matrix cores:
CUDA: wmma/mma PTX instructions, well-documented
ROCm: MFMA (Matrix Fused Multiply-Add) instructions
HIP provides: rocWMMA library (limited compared to CUTLASS)
5. Library availability:
cuBLAS -> rocBLAS (good parity)
cuDNN -> MIOpen (partial parity, some ops missing)
cuSPARSE -> rocSPARSE (reasonable parity)
NCCL -> RCCL (reasonable parity)
TensorRT -> N/A (no equivalent)
Nsight Compute -> rocprof (less polished)
CUTLASS -> composable_kernel (AMD's equivalent, less mature)
# hipify-perl: automated CUDA-to-HIP source translation
# Converts cuda* API calls to hip* equivalents
# Before (CUDA):
# cudaMalloc(&d_ptr, size);
# cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice);
# kernel<<<grid, block>>>(d_ptr);
# cudaDeviceSynchronize();
# After (HIP, via hipify):
# hipMalloc(&d_ptr, size);
# hipMemcpy(d_ptr, h_ptr, size, hipMemcpyHostToDevice);
# kernel<<<grid, block>>>(d_ptr);
# hipDeviceSynchronize();
# hipify handles ~95% of straightforward CUDA code automatically.
# The remaining 5% requires manual work:
# - Inline PTX assembly -> must be rewritten for AMDGPU ISA
# - CUDA-specific intrinsics (__ldg, __stcs) -> different AMD equivalents
# - Warp-size-dependent code -> needs warpSize abstraction
# - CUTLASS templates -> must use composable_kernel or write custom
ROCm Library Maturity vs CUDA
| Category | CUDA Library | ROCm Equivalent | Parity Level |
|---|---|---|---|
| BLAS | cuBLAS | rocBLAS | 95% (excellent) |
| DNN | cuDNN | MIOpen | 75% (gaps in attention, newer ops) |
| Sparse | cuSPARSE | rocSPARSE | 85% (good) |
| Collective Comms | NCCL | RCCL | 85% (good, some topology gaps) |
| Inference Engine | TensorRT / TensorRT-LLM | N/A | 0% (no equivalent) |
| GEMM Templates | CUTLASS | composable_kernel | 50% (early stage) |
| Profiler | Nsight Compute | rocprof / omniperf | 60% |
| Framework Support | PyTorch (native) | PyTorch (ROCm build) | 85% |
Apple Metal Compute
Metal Performance Shaders and MPS
Apple’s Metal API provides GPU compute access on Apple Silicon (M1-M4 chips) with unified memory architecture:
// Metal compute shader (MSL - Metal Shading Language)
// File: vector_add.metal
#include <metal_stdlib>
using namespace metal;
kernel void vector_add(
device const float* a [[buffer(0)]],
device const float* b [[buffer(1)]],
device float* c [[buffer(2)]],
uint idx [[thread_position_in_grid]]
) {
c[idx] = a[idx] + b[idx];
}
# PyTorch MPS backend (Apple Silicon)
import torch
# Use MPS device
device = torch.device("mps")
# Tensors on GPU
a = torch.randn(1000000, device=device)
b = torch.randn(1000000, device=device)
c = a + b # Computed on Apple GPU
# LLM inference on MPS:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
).to("mps")
# Works, but slower than CUDA due to:
# No tensor cores (Apple GPU uses different SIMD architecture)
# Limited memory bandwidth (M2 Ultra: 800 GB/s vs H100: 3,350 GB/s)
# MPS backend missing many fused operators
# No FlashAttention for MPS (separate Metal implementation needed)
Metal Limitations for AI
Apple Metal for AI workloads:
Advantages:
- Unified memory: no CPU-GPU copies (M2 Ultra: up to 192 GB shared)
- Power efficiency: M2 Ultra ~100W vs H100 700W
- macOS integration: runs natively on developer machines
- llama.cpp Metal backend: reasonable llm inference performance
Limitations:
- No dedicated tensor cores (uses generic SIMD units)
- FP16 compute throughput: M2 Ultra ~27 TFLOPS vs H100 990 TFLOPS (37x gap)
- Memory bandwidth: M2 Ultra ~800 GB/s vs H100 3,350 GB/s (4x gap)
- No multi-GPU support (each Apple Silicon is one chip)
- No datacenter deployment (no rack-mountable Apple GPU servers)
- Limited training support (no distributed training frameworks)
- PyTorch MPS backend has operator coverage gaps
- No equivalent to cuBLAS, cuDNN, NCCL, TensorRT
Practical use case:
- Local inference on MacBooks/Mac Studios with llama.cpp
- Development and testing before deploying to CUDA GPUs
- NOT suitable for production serving or training
Vulkan Compute
Cross-Platform GPU Access
Vulkan compute shaders provide GPU access on virtually every platform (Windows, Linux, macOS via MoltenVK, Android, iOS):
// Vulkan compute shader (GLSL)
#version 450
layout(set = 0, binding = 0) buffer InputA { float a[]; };
layout(set = 0, binding = 1) buffer InputB { float b[]; };
layout(set = 0, binding = 2) buffer Output { float c[]; };
layout(local_size_x = 256) in;
void main() {
uint idx = gl_GlobalInvocationID.x;
c[idx] = a[idx] + b[idx];
}
// Vulkan compute dispatch (C++ host code, simplified)
// The full Vulkan setup requires ~500 lines of boilerplate:
// 1. Create instance, physical device, logical device
// 2. Create compute pipeline with shader module
// 3. Allocate buffers and descriptor sets
// 4. Record command buffer with dispatch command
// 5. Submit to compute queue and wait
// Key dispatch code:
vkCmdBindPipeline(cmdBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
vkCmdBindDescriptorSets(cmdBuffer, VK_PIPELINE_BIND_POINT_COMPUTE,
pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);
vkCmdDispatch(cmdBuffer, (n + 255) / 256, 1, 1);
vkQueueSubmit(computeQueue, 1, &submitInfo, fence);
vkWaitForFences(device, 1, &fence, VK_TRUE, UINT64_MAX);
Vulkan for AI: llama.cpp’s Vulkan Backend
llama.cpp uses Vulkan compute for cross-platform inference, making LLMs accessible on AMD, Intel, and even mobile GPUs:
llama.cpp Vulkan backend:
- Matrix multiply via Vulkan compute shaders
- No tensor core support (Vulkan lacks vendor-specific extensions)
- Uses FP16 compute when supported (VK_KHR_shader_float16)
- Subgroup operations for warp-level primitives
- Works on: NVIDIA, AMD, Intel, Qualcomm Adreno, Apple (MoltenVK)
Performance vs CUDA (Llama-7B, INT4 quantized, decode BS=1):
- CUDA (RTX 4090): ~130 tok/s
- Vulkan (RTX 4090): ~85 tok/s (65% of CUDA)
- Vulkan (RX 7900 XTX): ~70 tok/s
- Vulkan (Arc A770): ~35 tok/s
- Metal (M2 Ultra): ~45 tok/s
The 35% performance gap vs CUDA comes from:
1. No tensor core access (cuBLAS INT4 kernels use tensor cores)
2. Generic compute shaders vs hand-optimized CUDA kernels
3. Vulkan dispatch overhead (more verbose command submission)
4. No shared memory bank conflict optimization (no hardware visibility)
LLM Inference Throughput by Programming Model (Llama-7B INT4, Decode BS=1)
(tokens/sec)SYCL and oneAPI
Intel’s Cross-Platform Abstraction
SYCL (originally from Khronos Group) provides a single-source C++ programming model that targets multiple backends (CUDA, ROCm, Level Zero/Intel GPUs, OpenCL):
// SYCL kernel: single-source C++
#include <sycl/sycl.hpp>
int main() {
sycl::queue q{sycl::gpu_selector_v};
float* a = sycl::malloc_device<float>(n, q);
float* b = sycl::malloc_device<float>(n, q);
float* c = sycl::malloc_device<float>(n, q);
q.memcpy(a, h_a, n * sizeof(float));
q.memcpy(b, h_b, n * sizeof(float));
q.wait();
q.parallel_for(sycl::range<1>(n), [=](sycl::id<1> idx) {
c[idx] = a[idx] + b[idx];
}).wait();
q.memcpy(h_c, c, n * sizeof(float));
q.wait();
sycl::free(a, q);
sycl::free(b, q);
sycl::free(c, q);
return 0;
}
// Compile for different targets:
// Intel GPU: icpx -fsycl -fsycl-targets=spir64_gen vector_add.cpp
// NVIDIA GPU: icpx -fsycl -fsycl-targets=nvptx64 vector_add.cpp
// AMD GPU: icpx -fsycl -fsycl-targets=amdgcn vector_add.cpp
// SYCL advantages:
// - Single source code for all GPU vendors
// - C++ standard-like interface (lambdas, ranges)
// - Automatic memory management options (USM, buffers)
//
// SYCL disadvantages:
// - Performance typically 70-90% of native (CUDA or ROCm)
// - Limited library ecosystem (oneMKL wraps cuBLAS/rocBLAS)
// - Vendor-specific optimizations not accessible
// - Debugging and profiling tools less mature
Performance Comparison Matrix
GPU Programming Model Comparison for AI Workloads
| Feature | CUDA | ROCm/HIP | Metal | Vulkan | SYCL |
|---|---|---|---|---|---|
| Vendor | NVIDIA only | AMD (+ NVIDIA via HIP) | Apple only | All vendors | All (via backends) |
| Tensor core access | Full (wmma, mma, wgmma) | Full (MFMA) | None | None | Via backend libs |
| GEMM library | cuBLAS (world-class) | rocBLAS (good) | MPSGraph (basic) | None (manual) | oneMKL (wraps native) |
| LLM framework support | All (vLLM, TRT-LLM, etc) | vLLM, some others | llama.cpp, MLX | llama.cpp | Limited |
| Profiling | Nsight (excellent) | rocprof (adequate) | Metal GPU Profiler | RenderDoc (limited) | Intel VTune |
| Typical perf vs CUDA | 100% (reference) | 85-95% | 10-15% | 55-70% | 70-90% |
| Production readiness | Mature | Improving rapidly | Development only | Hobbyist/edge | Early |
Practical Decision Framework
# Which programming model to use for AI workloads
def choose_gpu_platform(requirements):
"""Decision tree for GPU programming model selection."""
if requirements["workload"] == "training":
if requirements["scale"] == "large" or requirements["multi_gpu"]:
return "CUDA" # Only viable option for large-scale training
if requirements["hardware"] == "AMD MI300X":
return "ROCm" # AMD now viable for single-node training
return "CUDA"
if requirements["workload"] == "inference_serving":
if requirements["priority"] == "maximum_throughput":
return "CUDA" # TensorRT-LLM, vLLM with CUDA
if requirements["hardware"] == "AMD":
return "ROCm" # vLLM has ROCm support
if requirements["priority"] == "portability":
return "llama.cpp (CUDA/Metal/Vulkan auto-detected)"
return "CUDA"
if requirements["workload"] == "local_inference":
if requirements["hardware"] == "Apple Silicon":
return "Metal (via llama.cpp or MLX)"
if requirements["hardware"] == "AMD consumer":
return "Vulkan or ROCm"
if requirements["hardware"] == "NVIDIA consumer":
return "CUDA"
return "Vulkan (universal fallback)"
if requirements["workload"] == "research_prototyping":
return "CUDA (PyTorch)" # Largest ecosystem, most examples
For production AI inference, the practical choice today is CUDA for NVIDIA GPUs and ROCm for AMD datacenter GPUs (MI250X, MI300X). Everything else is either not production-ready (Metal, SYCL) or incurs a significant performance penalty (Vulkan). For local/edge inference on diverse hardware, llama.cpp’s multi-backend approach (CUDA + Metal + Vulkan) provides the broadest coverage with reasonable performance.
The Future: Will CUDA’s Dominance Persist?
Factors maintaining CUDA dominance:
1. Ecosystem lock-in: 17 years of libraries, tutorials, Stack Overflow answers
2. PyTorch is CUDA-first: most operators tested primarily on CUDA
3. TensorRT-LLM: no equivalent on other platforms
4. NCCL: best multi-GPU communication library
5. Developer familiarity: most GPU programmers learned CUDA first
Factors eroding CUDA dominance:
1. AMD MI300X is hardware-competitive (192 GB HBM3, 5.3 TB/s)
2. ROCm/HIP can auto-translate most CUDA code
3. Cloud providers negotiating AMD GPU contracts (cost pressure)
4. Triton (OpenAI) compiles to both CUDA PTX and AMD AMDGPU ISA
5. llama.cpp proves Vulkan/Metal viable for inference
6. MLIR/StableHLO provide hardware-agnostic intermediate representations
Timeline estimate:
- 2024-2025: CUDA >90% of AI compute, AMD gaining in inference
- 2026-2028: CUDA ~80%, AMD ~15%, others ~5%
- Long term: hardware-agnostic compilers (Triton, XLA) may abstract away
the programming model, making the GPU ISA less relevant
Summary
CUDA dominates GPU computing for AI through a combination of hardware excellence and an unmatched software ecosystem. ROCm/HIP is the closest competitor, providing 85-95% of CUDA performance with good library coverage and a source-level compatibility layer. Metal provides functional local inference on Apple Silicon but lacks the throughput for production serving. Vulkan offers universal portability at 55-70% of native performance. SYCL provides a clean programming abstraction but lacks ecosystem depth. For production AI workloads today, CUDA remains the default choice, with ROCm as a viable alternative for AMD datacenter GPUs. The long-term trend points toward hardware-agnostic compiler frameworks (Triton, XLA, MLIR) that abstract the programming model, potentially reducing CUDA’s ecosystem advantage.