GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance

Part of Series GPU Hardware & AI Accelerators 21 of 30

1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

CUDA’s dominance in GPU computing is not primarily a hardware story. NVIDIA’s GPUs are excellent, but AMD’s MI300X has competitive raw specifications. The dominance is a software story: CUDA provides a programming model with 17 years of tooling, libraries (cuBLAS, cuDNN, cuSPARSE, NCCL, TensorRT), profiling tools (Nsight Compute, Nsight Systems), and an ecosystem of frameworks (PyTorch, TensorFlow, JAX) that are deeply integrated with CUDA-specific features.

Every alternative GPU programming model faces the same challenge: matching not just the hardware performance but the entire software stack. This post compares the major alternatives on their programming model, current maturity for AI workloads, performance relative to CUDA, and practical usability.

CUDA: The Reference Model

Execution Model

CUDA’s execution model is built on three hierarchical abstractions: grids of thread blocks, thread blocks of warps, and warps of 32 threads that execute in lockstep (SIMT):

// CUDA kernel: explicit thread hierarchy
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    // Global thread ID
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

// Launch configuration
int threads_per_block = 256;
int blocks = (n + threads_per_block - 1) / threads_per_block;
vector_add<<<blocks, threads_per_block>>>(d_a, d_b, d_c, n);

// Memory model:
// - Global memory (HBM): visible to all threads, high latency (~500 cycles)
// - Shared memory: per-block, low latency (~30 cycles), programmer-managed
// - Registers: per-thread, fastest
// - Constant memory: cached, broadcast to warp
// - Texture memory: cached, optimized for 2D spatial locality

// Synchronization:
// - __syncthreads(): barrier within a thread block
// - __syncwarp(): barrier within a warp (Volta+)
// - atomicAdd/CAS: device-wide atomic operations
// - cooperative_groups: flexible synchronization

CUDA’s Software Ecosystem Advantage

NVIDIA CUDA ecosystem components (non-exhaustive):

Math Libraries:
  cuBLAS        - BLAS (GEMM, GEMV) with tensor core support
  cuBLASLt      - Lightweight BLAS with epilogue fusion
  cuSPARSE      - Sparse matrix operations
  cuSPARSELt    - 2:4 structured sparsity
  cuSOLVER      - Dense/sparse linear solvers
  cuFFT         - Fast Fourier Transform
  cuRAND        - Random number generation

Deep Learning:
  cuDNN         - Convolutions, normalization, attention, RNNs
  TensorRT      - Inference optimization and deployment
  TensorRT-LLM  - LLM-specific inference engine
  cuTENSOR      - Tensor contractions

Communication:
  NCCL          - Multi-GPU/multi-node collective communication
  NVSHMEM       - Symmetric heap memory for GPU-to-GPU access
  NVLink/NVSwitch - Hardware interconnect

Profiling:
  Nsight Compute - Kernel-level performance analysis
  Nsight Systems - System-level timeline profiling
  CUPTI          - Profiling API

Compiler/Runtime:
  nvcc          - CUDA compiler
  PTX           - Virtual ISA (portable across GPU generations)
  CUDA Runtime  - Memory management, streams, events
  CUDA Driver   - Low-level GPU control

Specialized:
  CUTLASS       - Template library for GEMM/convolution kernels
  CUB           - Block-level and device-level primitives
  Thrust        - STL-like parallel algorithms
  FlashAttention - Fused attention (third-party, CUDA-specific)

ℹ️ Note

The ecosystem is the moat. Each library in CUDA’s stack has been optimized by hundreds of engineers over years. A competitor must either replicate this entire stack or provide automated translation. Neither is straightforward.

AMD ROCm and HIP

HIP: The CUDA Compatibility Layer

AMD’s HIP (Heterogeneous-compute Interface for Portability) provides a CUDA-like programming model that can compile for both AMD and NVIDIA GPUs:

// HIP kernel: nearly identical syntax to CUDA
#include <hip/hip_runtime.h>

__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    // API is nearly 1:1 with CUDA (prefix changed from cuda to hip)
    float *d_a, *d_b, *d_c;
    hipMalloc(&d_a, n * sizeof(float));
    hipMalloc(&d_b, n * sizeof(float));
    hipMalloc(&d_c, n * sizeof(float));

    hipMemcpy(d_a, h_a, n * sizeof(float), hipMemcpyHostToDevice);
    hipMemcpy(d_b, h_b, n * sizeof(float), hipMemcpyHostToDevice);

    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add<<<blocks, threads>>>(d_a, d_b, d_c, n);

    hipMemcpy(h_c, d_c, n * sizeof(float), hipMemcpyDeviceToHost);

    hipFree(d_a);
    hipFree(d_b);
    hipFree(d_c);
    return 0;
}

// Compilation:
// AMD GPU: hipcc vector_add.cpp -o vector_add
// NVIDIA GPU: hipcc --platform nvidia vector_add.cpp -o vector_add

Key Differences from CUDA

HIP/ROCm vs CUDA differences:

1. Warp size:
   CUDA: warp = 32 threads (NVIDIA standard since G80)
   ROCm: wavefront = 64 threads (AMD CDNA/RDNA)
   Impact: warp-level primitives (__shfl, __ballot) operate on different widths
   HIP provides: warpSize variable (64 on AMD, 32 on NVIDIA)

2. Shared memory:
   CUDA: configurable L1/shared memory split (up to 228 KB on H100)
   ROCm: LDS (Local Data Share) is separate from L1 (64 KB per CU on MI300X)
   Impact: algorithms tuned for large shared memory may need adjustment

3. Memory model:
   CUDA: relaxed memory model with explicit fences
   ROCm: similar model, but different cache hierarchy
   MI300X: 256 MB Infinity Cache (much larger than NVIDIA L2)

4. Tensor cores vs Matrix cores:
   CUDA: wmma/mma PTX instructions, well-documented
   ROCm: MFMA (Matrix Fused Multiply-Add) instructions
   HIP provides: rocWMMA library (limited compared to CUTLASS)

5. Library availability:
   cuBLAS -> rocBLAS (good parity)
   cuDNN -> MIOpen (partial parity, some ops missing)
   cuSPARSE -> rocSPARSE (reasonable parity)
   NCCL -> RCCL (reasonable parity)
   TensorRT -> N/A (no equivalent)
   Nsight Compute -> rocprof (less polished)
   CUTLASS -> composable_kernel (AMD's equivalent, less mature)

# hipify-perl: automated CUDA-to-HIP source translation
# Converts cuda* API calls to hip* equivalents

# Before (CUDA):
# cudaMalloc(&d_ptr, size);
# cudaMemcpy(d_ptr, h_ptr, size, cudaMemcpyHostToDevice);
# kernel<<<grid, block>>>(d_ptr);
# cudaDeviceSynchronize();

# After (HIP, via hipify):
# hipMalloc(&d_ptr, size);
# hipMemcpy(d_ptr, h_ptr, size, hipMemcpyHostToDevice);
# kernel<<<grid, block>>>(d_ptr);
# hipDeviceSynchronize();

# hipify handles ~95% of straightforward CUDA code automatically.
# The remaining 5% requires manual work:
# - Inline PTX assembly -> must be rewritten for AMDGPU ISA
# - CUDA-specific intrinsics (__ldg, __stcs) -> different AMD equivalents
# - Warp-size-dependent code -> needs warpSize abstraction
# - CUTLASS templates -> must use composable_kernel or write custom

📊

ROCm Library Maturity vs CUDA

Category	CUDA Library	ROCm Equivalent	Parity Level
BLAS	cuBLAS	rocBLAS	95% (excellent)
DNN	cuDNN	MIOpen	75% (gaps in attention, newer ops)
Sparse	cuSPARSE	rocSPARSE	85% (good)
Collective Comms	NCCL	RCCL	85% (good, some topology gaps)
Inference Engine	TensorRT / TensorRT-LLM	N/A	0% (no equivalent)
GEMM Templates	CUTLASS	composable_kernel	50% (early stage)
Profiler	Nsight Compute	rocprof / omniperf	60%
Framework Support	PyTorch (native)	PyTorch (ROCm build)	85%

Apple Metal Compute

Metal Performance Shaders and MPS

Apple’s Metal API provides GPU compute access on Apple Silicon (M1-M4 chips) with unified memory architecture:

// Metal compute shader (MSL - Metal Shading Language)
// File: vector_add.metal
#include <metal_stdlib>
using namespace metal;

kernel void vector_add(
    device const float* a [[buffer(0)]],
    device const float* b [[buffer(1)]],
    device float* c [[buffer(2)]],
    uint idx [[thread_position_in_grid]]
) {
    c[idx] = a[idx] + b[idx];
}

# PyTorch MPS backend (Apple Silicon)
import torch

# Use MPS device
device = torch.device("mps")

# Tensors on GPU
a = torch.randn(1000000, device=device)
b = torch.randn(1000000, device=device)
c = a + b  # Computed on Apple GPU

# LLM inference on MPS:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
).to("mps")
# Works, but slower than CUDA due to:
# No tensor cores (Apple GPU uses different SIMD architecture)
# Limited memory bandwidth (M2 Ultra: 800 GB/s vs H100: 3,350 GB/s)
# MPS backend missing many fused operators
# No FlashAttention for MPS (separate Metal implementation needed)

Metal Limitations for AI

Apple Metal for AI workloads:

Advantages:
  - Unified memory: no CPU-GPU copies (M2 Ultra: up to 192 GB shared)
  - Power efficiency: M2 Ultra ~100W vs H100 700W
  - macOS integration: runs natively on developer machines
  - llama.cpp Metal backend: reasonable llm inference performance

Limitations:
  - No dedicated tensor cores (uses generic SIMD units)
  - FP16 compute throughput: M2 Ultra ~27 TFLOPS vs H100 990 TFLOPS (37x gap)
  - Memory bandwidth: M2 Ultra ~800 GB/s vs H100 3,350 GB/s (4x gap)
  - No multi-GPU support (each Apple Silicon is one chip)
  - No datacenter deployment (no rack-mountable Apple GPU servers)
  - Limited training support (no distributed training frameworks)
  - PyTorch MPS backend has operator coverage gaps
  - No equivalent to cuBLAS, cuDNN, NCCL, TensorRT

Practical use case:
  - Local inference on MacBooks/Mac Studios with llama.cpp
  - Development and testing before deploying to CUDA GPUs
  - NOT suitable for production serving or training

Vulkan Compute

Cross-Platform GPU Access

Vulkan compute shaders provide GPU access on virtually every platform (Windows, Linux, macOS via MoltenVK, Android, iOS):

// Vulkan compute shader (GLSL)
#version 450

layout(set = 0, binding = 0) buffer InputA { float a[]; };
layout(set = 0, binding = 1) buffer InputB { float b[]; };
layout(set = 0, binding = 2) buffer Output { float c[]; };

layout(local_size_x = 256) in;

void main() {
    uint idx = gl_GlobalInvocationID.x;
    c[idx] = a[idx] + b[idx];
}

// Vulkan compute dispatch (C++ host code, simplified)
// The full Vulkan setup requires ~500 lines of boilerplate:
// 1. Create instance, physical device, logical device
// 2. Create compute pipeline with shader module
// 3. Allocate buffers and descriptor sets
// 4. Record command buffer with dispatch command
// 5. Submit to compute queue and wait

// Key dispatch code:
vkCmdBindPipeline(cmdBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
vkCmdBindDescriptorSets(cmdBuffer, VK_PIPELINE_BIND_POINT_COMPUTE,
                         pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);
vkCmdDispatch(cmdBuffer, (n + 255) / 256, 1, 1);
vkQueueSubmit(computeQueue, 1, &submitInfo, fence);
vkWaitForFences(device, 1, &fence, VK_TRUE, UINT64_MAX);

Vulkan for AI: llama.cpp’s Vulkan Backend

llama.cpp uses Vulkan compute for cross-platform inference, making LLMs accessible on AMD, Intel, and even mobile GPUs:

llama.cpp Vulkan backend:
  - Matrix multiply via Vulkan compute shaders
  - No tensor core support (Vulkan lacks vendor-specific extensions)
  - Uses FP16 compute when supported (VK_KHR_shader_float16)
  - Subgroup operations for warp-level primitives
  - Works on: NVIDIA, AMD, Intel, Qualcomm Adreno, Apple (MoltenVK)

Performance vs CUDA (Llama-7B, INT4 quantized, decode BS=1):
  - CUDA (RTX 4090):    ~130 tok/s
  - Vulkan (RTX 4090):  ~85 tok/s   (65% of CUDA)
  - Vulkan (RX 7900 XTX): ~70 tok/s
  - Vulkan (Arc A770):  ~35 tok/s
  - Metal (M2 Ultra):   ~45 tok/s

The 35% performance gap vs CUDA comes from:
  1. No tensor core access (cuBLAS INT4 kernels use tensor cores)
  2. Generic compute shaders vs hand-optimized CUDA kernels
  3. Vulkan dispatch overhead (more verbose command submission)
  4. No shared memory bank conflict optimization (no hardware visibility)

LLM Inference Throughput by Programming Model (Llama-7B INT4, Decode BS=1)

(tokens/sec)

CUDA (RTX 4090) cuBLAS + tensor cores

130 tokens/sec

Vulkan (RTX 4090) 65% of CUDA

85 tokens/sec

ROCm (RX 7900 XTX) rocBLAS + matrix cores

92 tokens/sec

Vulkan (RX 7900 XTX) 76% of ROCm

70 tokens/sec

Metal (M2 Ultra) Unified memory helps

45 tokens/sec

Vulkan (Arc A770) Intel XMX limited

35 tokens/sec

SYCL and oneAPI

Intel’s Cross-Platform Abstraction

SYCL (originally from Khronos Group) provides a single-source C++ programming model that targets multiple backends (CUDA, ROCm, Level Zero/Intel GPUs, OpenCL):

// SYCL kernel: single-source C++
#include <sycl/sycl.hpp>

int main() {
    sycl::queue q{sycl::gpu_selector_v};

    float* a = sycl::malloc_device<float>(n, q);
    float* b = sycl::malloc_device<float>(n, q);
    float* c = sycl::malloc_device<float>(n, q);

    q.memcpy(a, h_a, n * sizeof(float));
    q.memcpy(b, h_b, n * sizeof(float));
    q.wait();

    q.parallel_for(sycl::range<1>(n), [=](sycl::id<1> idx) {
        c[idx] = a[idx] + b[idx];
    }).wait();

    q.memcpy(h_c, c, n * sizeof(float));
    q.wait();

    sycl::free(a, q);
    sycl::free(b, q);
    sycl::free(c, q);

    return 0;
}

// Compile for different targets:
// Intel GPU: icpx -fsycl -fsycl-targets=spir64_gen vector_add.cpp
// NVIDIA GPU: icpx -fsycl -fsycl-targets=nvptx64 vector_add.cpp
// AMD GPU: icpx -fsycl -fsycl-targets=amdgcn vector_add.cpp

// SYCL advantages:
// - Single source code for all GPU vendors
// - C++ standard-like interface (lambdas, ranges)
// - Automatic memory management options (USM, buffers)
//
// SYCL disadvantages:
// - Performance typically 70-90% of native (CUDA or ROCm)
// - Limited library ecosystem (oneMKL wraps cuBLAS/rocBLAS)
// - Vendor-specific optimizations not accessible
// - Debugging and profiling tools less mature

Performance Comparison Matrix

📊

GPU Programming Model Comparison for AI Workloads

Feature	CUDA	ROCm/HIP	Metal	Vulkan	SYCL
Vendor	NVIDIA only	AMD (+ NVIDIA via HIP)	Apple only	All vendors	All (via backends)
Tensor core access	Full (wmma, mma, wgmma)	Full (MFMA)	None	None	Via backend libs
GEMM library	cuBLAS (world-class)	rocBLAS (good)	MPSGraph (basic)	None (manual)	oneMKL (wraps native)
LLM framework support	All (vLLM, TRT-LLM, etc)	vLLM, some others	llama.cpp, MLX	llama.cpp	Limited
Profiling	Nsight (excellent)	rocprof (adequate)	Metal GPU Profiler	RenderDoc (limited)	Intel VTune
Typical perf vs CUDA	100% (reference)	85-95%	10-15%	55-70%	70-90%
Production readiness	Mature	Improving rapidly	Development only	Hobbyist/edge	Early

Practical Decision Framework

# Which programming model to use for AI workloads

def choose_gpu_platform(requirements):
    """Decision tree for GPU programming model selection."""

    if requirements["workload"] == "training":
        if requirements["scale"] == "large" or requirements["multi_gpu"]:
            return "CUDA"  # Only viable option for large-scale training
        if requirements["hardware"] == "AMD MI300X":
            return "ROCm"  # AMD now viable for single-node training
        return "CUDA"

    if requirements["workload"] == "inference_serving":
        if requirements["priority"] == "maximum_throughput":
            return "CUDA"  # TensorRT-LLM, vLLM with CUDA
        if requirements["hardware"] == "AMD":
            return "ROCm"  # vLLM has ROCm support
        if requirements["priority"] == "portability":
            return "llama.cpp (CUDA/Metal/Vulkan auto-detected)"
        return "CUDA"

    if requirements["workload"] == "local_inference":
        if requirements["hardware"] == "Apple Silicon":
            return "Metal (via llama.cpp or MLX)"
        if requirements["hardware"] == "AMD consumer":
            return "Vulkan or ROCm"
        if requirements["hardware"] == "NVIDIA consumer":
            return "CUDA"
        return "Vulkan (universal fallback)"

    if requirements["workload"] == "research_prototyping":
        return "CUDA (PyTorch)"  # Largest ecosystem, most examples

💡 Tip

For production AI inference, the practical choice today is CUDA for NVIDIA GPUs and ROCm for AMD datacenter GPUs (MI250X, MI300X). Everything else is either not production-ready (Metal, SYCL) or incurs a significant performance penalty (Vulkan). For local/edge inference on diverse hardware, llama.cpp’s multi-backend approach (CUDA + Metal + Vulkan) provides the broadest coverage with reasonable performance.

The Future: Will CUDA’s Dominance Persist?

Factors maintaining CUDA dominance:
  1. Ecosystem lock-in: 17 years of libraries, tutorials, Stack Overflow answers
  2. PyTorch is CUDA-first: most operators tested primarily on CUDA
  3. TensorRT-LLM: no equivalent on other platforms
  4. NCCL: best multi-GPU communication library
  5. Developer familiarity: most GPU programmers learned CUDA first

Factors eroding CUDA dominance:
  1. AMD MI300X is hardware-competitive (192 GB HBM3, 5.3 TB/s)
  2. ROCm/HIP can auto-translate most CUDA code
  3. Cloud providers negotiating AMD GPU contracts (cost pressure)
  4. Triton (OpenAI) compiles to both CUDA PTX and AMD AMDGPU ISA
  5. llama.cpp proves Vulkan/Metal viable for inference
  6. MLIR/StableHLO provide hardware-agnostic intermediate representations

Timeline estimate:
  - 2024-2025: CUDA >90% of AI compute, AMD gaining in inference
  - 2026-2028: CUDA ~80%, AMD ~15%, others ~5%
  - Long term: hardware-agnostic compilers (Triton, XLA) may abstract away
    the programming model, making the GPU ISA less relevant

Summary

CUDA dominates GPU computing for AI through a combination of hardware excellence and an unmatched software ecosystem. ROCm/HIP is the closest competitor, providing 85-95% of CUDA performance with good library coverage and a source-level compatibility layer. Metal provides functional local inference on Apple Silicon but lacks the throughput for production serving. Vulkan offers universal portability at 55-70% of native performance. SYCL provides a clean programming abstraction but lacks ecosystem depth. For production AI workloads today, CUDA remains the default choice, with ROCm as a viable alternative for AMD datacenter GPUs. The long-term trend points toward hardware-agnostic compiler frameworks (Triton, XLA, MLIR) that abstract the programming model, potentially reducing CUDA’s ecosystem advantage.

CUDA: The Reference Model

Execution Model

CUDA’s Software Ecosystem Advantage

AMD ROCm and HIP

HIP: The CUDA Compatibility Layer

Key Differences from CUDA

ROCm Library Maturity vs CUDA

Apple Metal Compute

Metal Performance Shaders and MPS

Metal Limitations for AI

Vulkan Compute

Cross-Platform GPU Access

Vulkan for AI: llama.cpp’s Vulkan Backend

LLM Inference Throughput by Programming Model (Llama-7B INT4, Decode BS=1)

SYCL and oneAPI

Intel’s Cross-Platform Abstraction

Performance Comparison Matrix

GPU Programming Model Comparison for AI Workloads

Practical Decision Framework

The Future: Will CUDA’s Dominance Persist?

Summary

Stanley Phoong

Related Posts

AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat

CUDA Graphs for LLM Inference: Killing Startup Latency and Python Overhead

CUDA Graphs for LLM Inference: Eliminating Kernel Launch Overhead from First Principles