Vector Processing: From ARM NEON to AVX-512 to GPU SIMT

Part of Series GPU Hardware & AI Accelerators 35 of 30

1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

Introduction

A scalar loop summing 1,000 floats executes 1,000 add instructions at perhaps 2-4 cycles each—2,000-4,000 cycles total. An AVX2 loop processes 8 floats per instruction, dropping to 125 instructions—250-500 cycles with the same latency. An AVX-512 loop processes 16 floats per instruction—63 instructions, 126-252 cycles. A GPU warp processes 32 elements per instruction with 32-thread parallelism, achieving effective throughput of 1,000+ elements in under 50 cycles when memory-bound. The performance gap between scalar code and vector code is not 10-20%—it is 10-80x depending on the workload. Every performance-sensitive codebase in production—from JPEG decoders to neural network inference to physics simulations—uses vector processing. The only question is which variant: CPU SIMD (SSE, AVX, NEON) or GPU SIMT (warps, wavefronts).

This post covers the full vector processing landscape: CPU SIMD from SSE through AVX-512 and ARM NEON/SVE, GPU SIMT execution with warps, when compilers auto-vectorize successfully (and when they fail), when to drop to intrinsics, and the critical threshold where CPU SIMD stops being sufficient and you need a GPU.

The Concept: Why Vector Processing Exists

A scalar processor executes one operation per instruction: one add, one multiply, one load. A vector processor executes the same operation on multiple data elements simultaneously. If you need to add two arrays of 1,000 floats, a scalar processor issues 1,000 add instructions. An AVX2 processor issues 125 instructions (8 floats per instruction). An AVX-512 processor issues 63 instructions (16 floats per instruction). A GPU warp processes 32 elements per instruction.

The hardware cost of processing $N$ elements in parallel is far less than $N$ times the cost of processing one element. The control logic (instruction fetch, decode, scheduling) is shared across all lanes. Only the execution units and register file scale with width. This is why vector processing provides such high throughput per watt and per transistor.

📊

Vector Width Across Architectures

Architecture	Register Width	Floats/Op	Int32s/Op	Int8s/Op	Era
SSE/SSE2	128 bits	4	4	16	1999-2001
AVX/AVX2	256 bits	8	8	32	2011-2013
AVX-512	512 bits	16	16	64	2017+
ARM NEON	128 bits	4	4	16	2011+ (ARMv7/v8)
ARM SVE/SVE2	128-2048 bits	Scalable	Scalable	Scalable	2020+
RISC-V V ext.	Variable	Scalable	Scalable	Scalable	2021+
GPU warp (NVIDIA)	32 threads	32	32	128 (INT8 tensor)	2006+
GPU wavefront (AMD)	64 threads	64	64	256	2011+

CPU SIMD: SSE, AVX2, AVX-512, ARM NEON

SSE/SSE2 (128-bit)

SSE (Streaming SIMD Extensions) introduced 128-bit XMM registers that hold 4 single-precision floats or 2 doubles. SSE2 added integer operations and double-precision support. Every x86-64 processor supports SSE2 — it is the baseline for modern x86.

#include <xmmintrin.h>  // SSE
#include <emmintrin.h>  // SSE2

// SSE: 4-wide float add
void add_arrays_sse(float* out, const float* a, const float* b, int n) {
    for (int i = 0; i < n; i += 4) {
        __m128 va = _mm_load_ps(&a[i]);
        __m128 vb = _mm_load_ps(&b[i]);
        __m128 vc = _mm_add_ps(va, vb);
        _mm_store_ps(&out[i], vc);
    }
}

SSE is still relevant for code that must run on the widest range of x86 hardware. For portable libraries and system-level code, SSE2 is often the target.

AVX2 (256-bit)

AVX2 doubled the register width to 256 bits (8 floats) and added FMA (fused multiply-add) instructions. FMA computes $a \times b + c$ in a single instruction with a single rounding, which doubles peak floating-point throughput compared to separate multiply and add.

#include <immintrin.h>

// AVX2 dot product with FMA
float dot_product_avx2(const float* a, const float* b, int n) {
    __m256 sum = _mm256_setzero_ps();

    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_load_ps(&a[i]);
        __m256 vb = _mm256_load_ps(&b[i]);
        sum = _mm256_fmadd_ps(va, vb, sum);  // sum += a * b (fused)
    }

    // Horizontal reduction: sum all 8 lanes
    __m128 hi = _mm256_extractf128_ps(sum, 1);
    __m128 lo = _mm256_castps256_ps128(sum);
    __m128 s = _mm_add_ps(hi, lo);           // 4 elements
    s = _mm_hadd_ps(s, s);                    // 2 elements
    s = _mm_hadd_ps(s, s);                    // 1 element
    return _mm_cvtss_f32(s);
}

AVX2 is the sweet spot for most x86 SIMD code today. It is supported by all processors from Intel Haswell (2013) and AMD Zen 2 (2019) onward, which covers essentially all server and desktop hardware in production.

⚡ FMA Is Critical for Numerical Workloads

FMA doubles the peak FLOP rate compared to separate multiply and add. A Zen 4 core can execute two 256-bit FMA instructions per cycle, yielding $2 \times 8 \times 2 = 32$ FLOPS per cycle per core (where the factor of 2 in FMA counts both the multiply and add). Without FMA, peak throughput halves. Always use FMA when accumulating products (dot products, matrix multiply, convolutions).

AVX-512 (512-bit)

AVX-512 doubles the width again to 512 bits (16 floats) and adds powerful features: predicated execution with mask registers, gather/scatter instructions, and conflict detection. However, AVX-512 support is fragmented:

Intel Skylake-SP (2017): First server support
Intel Ice Lake (2019): Desktop support
AMD Zen 4 (2022): First AMD support (256-bit execution internally — two 256-bit uops per 512-bit instruction)
Intel Alder Lake/Raptor Lake: Disabled on efficiency cores

#include <immintrin.h>

// AVX-512 masked conditional processing
void threshold_avx512(float* out, const float* in, float thresh, int n) {
    __m512 vthresh = _mm512_set1_ps(thresh);

    for (int i = 0; i < n; i += 16) {
        __m512 v = _mm512_load_ps(&in[i]);

        // Create mask: which elements exceed threshold?
        __mmask16 mask = _mm512_cmp_ps_mask(v, vthresh, _CMP_GT_OQ);

        // Store only elements that pass the threshold
        // Others remain unchanged in output
        _mm512_mask_storeu_ps(&out[i], mask, v);
    }
}

// AVX-512 gather: load from non-contiguous addresses
void gather_example(float* out, const float* table, const int* indices, int n) {
    for (int i = 0; i < n; i += 16) {
        __m512i vidx = _mm512_load_si512(&indices[i]);
        __m512 gathered = _mm512_i32gather_ps(vidx, table, 4);  // scale=4 (float)
        _mm512_store_ps(&out[i], gathered);
    }
}

📊

AVX-512 vs AVX2 Performance

Operation	AVX2 (ns/1M elems)	AVX-512 (ns/1M elems)	Speedup	Notes
Array add	125	68	1.84x	Near-ideal 2x
Dot product	210	115	1.83x	FMA bandwidth limited
Masked store	380 (emulated)	140	2.71x	Native masking wins big
Gather (random)	850	520	1.63x	Memory-bound, less benefit
GEMM (256x256)	4200	2400	1.75x	Compute-bound

Note: Measured on Intel Sapphire Rapids. AMD Zen 4 AVX-512 is internally 256-bit, so speedup is lower (~1.1-1.3x over AVX2).

⚠️ AVX-512 Frequency Throttling

On older Intel CPUs (Skylake, Cascade Lake), executing AVX-512 instructions causes the CPU to reduce its clock frequency by 100-200 MHz to stay within power limits. This means AVX-512 code must provide at least 10-15% more throughput per cycle to break even after the frequency penalty. On newer Intel CPUs (Sapphire Rapids, Granite Rapids) and AMD Zen 4+, the throttling is reduced or eliminated.

ARM NEON (128-bit)

ARM NEON provides 128-bit SIMD on all ARMv8-A processors (Cortex-A53+, Apple M-series, Graviton). NEON is mandatory in ARMv8, so unlike x86 where you must check for SSE/AVX support, you can always assume NEON is available on 64-bit ARM.

#include <arm_neon.h>

// NEON FIR filter
void fir_filter_neon(const float* input, float* output,
                     const float* coeffs, int n, int taps) {
    for (int i = 0; i < n; i += 4) {
        float32x4_t acc = vdupq_n_f32(0.0f);

        for (int t = 0; t < taps; t++) {
            float32x4_t in_vec = vld1q_f32(&input[i + t]);
            float32x4_t co_val = vdupq_n_f32(coeffs[t]);  // Broadcast
            acc = vfmaq_f32(acc, in_vec, co_val);           // FMA
        }

        vst1q_f32(&output[i], acc);
    }
}

// NEON dot product
float dot_product_neon(const float* a, const float* b, int n) {
    float32x4_t sum = vdupq_n_f32(0.0f);

    for (int i = 0; i < n; i += 4) {
        float32x4_t va = vld1q_f32(&a[i]);
        float32x4_t vb = vld1q_f32(&b[i]);
        sum = vfmaq_f32(sum, va, vb);
    }

    // Horizontal sum
    return vaddvq_f32(sum);  // ARMv8 has vaddvq for horizontal reduction
}

NEON is particularly important for mobile and edge computing. Apple’s M-series chips, AWS Graviton processors, and Qualcomm Snapdragon all use NEON for compute-intensive workloads. The instruction set is clean and orthogonal, making it arguably more pleasant to write intrinsics for than x86 SSE/AVX.

ARM SVE/SVE2 (Scalable)

SVE (Scalable Vector Extension) breaks from the fixed-width SIMD tradition. Instead of specifying a register width, SVE code operates on vectors of implementation-defined length. The same binary works on hardware with 128-bit to 2048-bit vector registers.

#include <arm_sve.h>

// SVE vector add -- works at any vector length
void add_arrays_sve(float* out, const float* a, const float* b, int n) {
    for (int i = 0; i < n; i += svcntw()) {  // svcntw: floats per vector
        svbool_t pg = svwhilelt_b32(i, n);     // Predicate: active lanes
        svfloat32_t va = svld1(pg, &a[i]);
        svfloat32_t vb = svld1(pg, &b[i]);
        svfloat32_t vc = svadd_f32_m(pg, va, vb);
        svst1(pg, &out[i], vc);
    }
}

SVE’s key innovation is the predicate-first programming model. Every operation takes a predicate mask that controls which lanes are active. This cleanly handles loop tails (when $n$ is not a multiple of the vector length) without separate scalar cleanup code. AWS Graviton3 implements SVE with 256-bit vectors; Fujitsu’s A64FX (used in the Fugaku supercomputer) implements it with 512-bit vectors.

📊

SIMD Instruction Set Comparison Summary

ISA	Width	FMA?	Masking	Gather/Scatter	Best For
SSE2	128b	No	Emulated	No	Baseline compatibility
AVX2+FMA	256b	Yes	Emulated	Gather only	General-purpose SIMD
AVX-512	512b	Yes	Native (k-masks)	Yes	HPC, AI inference
NEON	128b	Yes (v8)	Limited	No	ARM mobile/server
SVE/SVE2	Scalable	Yes	Native (predicates)	Yes	Portable ARM HPC

GPU SIMT: Warps and Wavefronts

GPUs take vector processing to an extreme. Instead of wide registers operated on by a single thread, GPUs use many threads that execute in lockstep groups called warps (NVIDIA, 32 threads) or wavefronts (AMD, 32 or 64 threads).

How SIMT Works

In SIMT (Single Instruction, Multiple Threads), each thread has its own registers and program counter (conceptually), but threads in a warp execute the same instruction at the same time. When all threads in a warp take the same branch, execution is efficient. When threads diverge (different threads take different branches), the warp executes both paths serially, masking out inactive threads.

// GPU kernel: each thread processes one element
__global__ void vector_add(float* out, const float* a, const float* b, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = a[idx] + b[idx];
    }
}
// Launched with: vector_add<<<(n+255)/256, 256>>>(out, a, b, n);
// Each warp (32 threads) processes 32 elements simultaneously

The key difference from CPU SIMD: in CPU SIMD, the programmer explicitly loads data into vector registers and calls vector operations. In GPU SIMT, the programmer writes scalar code for one element, and the hardware automatically executes it across 32 threads. The parallelism is implicit in the programming model.

CPU SIMD (explicit vectorization):
  Thread 0: load 8 floats -> multiply 8 floats -> store 8 floats

GPU SIMT (implicit vectorization):
  Threads 0-31: each loads 1 float -> each multiplies 1 float -> each stores 1 float
  (executed simultaneously by warp scheduler)

Warp Divergence: The GPU’s Branch Problem

CPU SIMD handles branches by computing both paths and blending with masks (AVX-512 does this natively). GPU SIMT handles branches by executing both paths serially with thread masking:

__global__ void conditional_kernel(float* out, const float* in, float thresh, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        if (in[idx] > thresh) {        // Warp divergence if some threads go each way
            out[idx] = in[idx] * 2.0f;  // Path A: active threads execute
        } else {
            out[idx] = in[idx] * 0.5f;  // Path B: active threads execute
        }
    }
}
// Worst case: 50% of warp takes each branch = 50% utilization
// Best case: all threads take same branch = 100% utilization

📊

Warp Divergence Impact

Divergence Pattern	Active Threads (avg)	Throughput vs No Divergence
No divergence (all same path)	32/32	100%
2-way, 50/50 split	16/32	~50%
4-way divergence	8/32	~25%
Fully random (per-thread)	1/32	~3% (catastrophic)

Note: Divergence only matters within a warp. Different warps can take different paths with no penalty.

ℹ️ SIMD vs SIMT: Same Idea, Different Abstraction

CPU SIMD and GPU SIMT are both implementations of data parallelism. The difference is where the parallelism is expressed:

CPU SIMD: Parallelism is explicit in the instruction (one instruction operates on N data elements in a vector register).
GPU SIMT: Parallelism is implicit in the thread model (N threads execute the same scalar instruction simultaneously).

Both have the same fundamental limitation: branchy, irregular code performs poorly because some lanes/threads are idle during divergent execution.

Auto-Vectorization by Compilers

Modern compilers (GCC, Clang, MSVC, ICC) can automatically convert scalar loops to SIMD instructions. This works well for simple patterns and poorly for complex ones.

What Auto-Vectorizes Well

// Perfect auto-vectorization candidate: independent, contiguous, uniform
void scale_array(float* __restrict__ out, const float* __restrict__ in,
                 float factor, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = in[i] * factor;
    }
}
// gcc -O3 -mavx2: generates vmulps ymm (8-wide multiply)
// clang -O3 -mavx2: same, plus may unroll 2-4x

The __restrict__ qualifier tells the compiler that out and in do not overlap, enabling vectorization. Without it, the compiler must assume potential aliasing and may not vectorize.

What Does Not Auto-Vectorize

// Loop-carried dependency: cannot vectorize
void prefix_sum(float* out, const float* in, int n) {
    out[0] = in[0];
    for (int i = 1; i < n; i++) {
        out[i] = out[i-1] + in[i];  // Depends on previous iteration
    }
}

// Conditional with side effects: difficult to vectorize
void conditional_accumulate(float* sum, const float* data,
                            const int* flags, int n) {
    for (int i = 0; i < n; i++) {
        if (flags[i]) {
            *sum += data[i];  // Reduction with conditional
        }
    }
}

// Indirect access (gather): vectorizes poorly before AVX2/AVX-512
void gather_lookup(float* out, const float* table,
                   const int* indices, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = table[indices[i]];  // Random access pattern
    }
}

📊

Auto-Vectorization Success Rate by Pattern

Pattern	GCC Auto-vec?	Clang Auto-vec?	Manual Intrinsics Needed?
Simple array operation	Yes	Yes	No (auto-vec is optimal)
Reduction (sum, min, max)	Yes	Yes	Rarely
Conditional assignment	Sometimes	Yes (if-conversion)	Sometimes
Sliding window (FIR)	Partial	Partial	Yes (2x+ gain)
Prefix sum	No	No	Yes (parallel prefix)
AoS to SoA conversion	No	Sometimes	Yes
Bit manipulation	Sometimes	Sometimes	Often

Note: Test with -fopt-info-vec (GCC) or -Rpass=loop-vectorize (Clang) to see what the compiler vectorizes.

Verifying Auto-Vectorization

Always verify that the compiler actually vectorized your loop. Do not assume.

# GCC: show which loops were vectorized
gcc -O3 -mavx2 -fopt-info-vec-optimized mycode.c
# Output: mycode.c:12:5: optimized: loop vectorized using 32 byte vectors

# GCC: show which loops FAILED to vectorize and why
gcc -O3 -mavx2 -fopt-info-vec-missed mycode.c
# Output: mycode.c:20:5: missed: couldn't vectorize loop
#         mycode.c:20:5: missed: not vectorized: unsupported data-ref access

# Clang: similar but different flags
clang -O3 -mavx2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize mycode.c

💡 The 3 Requirements for Vectorization

SIMD (auto or manual) works when data is: (1) contiguous in memory — arrays, not linked lists; (2) independent — no loop-carried dependencies between iterations; (3) uniform — same operation on every element, minimal per-element branching. If any of these are missing, vectorization is limited or impossible.

Intrinsics for Manual Vectorization

When auto-vectorization fails or produces suboptimal code, intrinsics give you direct control over SIMD instructions while remaining in C/C++ (no assembly required).

Real-World Example: Vectorized Softmax

Softmax is ubiquitous in neural network inference. Here is a comparison of scalar, auto-vectorized, and hand-written AVX2 implementations:

// Scalar softmax (baseline)
void softmax_scalar(float* out, const float* in, int n) {
    // Find max (for numerical stability)
    float max_val = in[0];
    for (int i = 1; i < n; i++) {
        if (in[i] > max_val) max_val = in[i];
    }

    // Compute exp(x - max) and sum
    float sum = 0;
    for (int i = 0; i < n; i++) {
        out[i] = expf(in[i] - max_val);
        sum += out[i];
    }

    // Normalize
    float inv_sum = 1.0f / sum;
    for (int i = 0; i < n; i++) {
        out[i] *= inv_sum;
    }
}

// AVX2 softmax with fast exp approximation
#include <immintrin.h>

// Fast exp approximation (relative error < 0.1%)
static inline __m256 fast_exp_avx2(__m256 x) {
    // Clamp to avoid overflow/underflow
    x = _mm256_max_ps(x, _mm256_set1_ps(-88.0f));
    x = _mm256_min_ps(x, _mm256_set1_ps(88.0f));

    // exp(x) = 2^(x * log2(e))
    __m256 log2e = _mm256_set1_ps(1.44269504f);
    __m256 t = _mm256_mul_ps(x, log2e);

    // Split into integer and fractional parts
    __m256 ti = _mm256_round_ps(t, _MM_FROUND_TO_NEAREST_INT);
    __m256 tf = _mm256_sub_ps(t, ti);

    // 2^fraction using polynomial approximation
    __m256 p = _mm256_fmadd_ps(tf, _mm256_set1_ps(0.240226507f), _mm256_set1_ps(0.693147182f));
    p = _mm256_fmadd_ps(p, tf, _mm256_set1_ps(1.0f));

    // 2^integer using bit manipulation
    __m256i ii = _mm256_cvtps_epi32(ti);
    ii = _mm256_add_epi32(ii, _mm256_set1_epi32(127));
    ii = _mm256_slli_epi32(ii, 23);
    __m256 pow2i = _mm256_castsi256_ps(ii);

    return _mm256_mul_ps(p, pow2i);
}

void softmax_avx2(float* out, const float* in, int n) {
    // Pass 1: find max (8-wide)
    __m256 vmax = _mm256_load_ps(in);
    for (int i = 8; i < n; i += 8) {
        __m256 v = _mm256_load_ps(&in[i]);
        vmax = _mm256_max_ps(vmax, v);
    }
    // Horizontal max reduction
    __m128 hi = _mm256_extractf128_ps(vmax, 1);
    __m128 lo = _mm256_castps256_ps128(vmax);
    __m128 m = _mm_max_ps(hi, lo);
    m = _mm_max_ps(m, _mm_shuffle_ps(m, m, _MM_SHUFFLE(1,0,3,2)));
    m = _mm_max_ps(m, _mm_shuffle_ps(m, m, _MM_SHUFFLE(0,1,0,1)));
    __m256 max_val = _mm256_broadcastss_ps(m);

    // Pass 2: exp(x - max) and sum (8-wide)
    __m256 vsum = _mm256_setzero_ps();
    for (int i = 0; i < n; i += 8) {
        __m256 v = _mm256_load_ps(&in[i]);
        __m256 shifted = _mm256_sub_ps(v, max_val);
        __m256 e = fast_exp_avx2(shifted);
        _mm256_store_ps(&out[i], e);
        vsum = _mm256_add_ps(vsum, e);
    }
    // Horizontal sum
    float sum = horizontal_sum_avx2(vsum);
    __m256 inv = _mm256_set1_ps(1.0f / sum);

    // Pass 3: normalize (8-wide)
    for (int i = 0; i < n; i += 8) {
        __m256 v = _mm256_load_ps(&out[i]);
        _mm256_store_ps(&out[i], _mm256_mul_ps(v, inv));
    }
}

📊

Softmax Performance (n=4096 floats, Zen 4)

Implementation	Time (us)	Speedup	Accuracy
Scalar (gcc -O3)	18.2	1.0x	Exact (libm expf)
Auto-vectorized (gcc -O3 -mavx2)	6.8	2.7x	Exact
AVX2 intrinsics + fast exp	2.1	8.7x	< 0.1% relative error
AVX-512 intrinsics + fast exp	1.2	15.2x	< 0.1% relative error

Note: Auto-vectorization cannot use the fast exp approximation, limiting its speedup. The intrinsics version uses a polynomial approximation that is much faster than libm.

The 3.2x gap between auto-vectorization and manual intrinsics comes from the fast exp approximation. The compiler cannot replace expf() with a less-accurate-but-faster version — that is a semantic change that only the programmer can make. This is a common pattern: the biggest wins from intrinsics often come not from better SIMD usage, but from algorithmic changes that are only possible with manual control.

Mapping the Same Algorithm: CPU SIMD vs GPU

Let us trace how a simple algorithm — element-wise ReLU ( $\max(0, x)$ ) — maps to different hardware:

Scalar CPU

void relu_scalar(float* out, const float* in, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = in[i] > 0 ? in[i] : 0;
    }
}
// 1 element per iteration, ~n cycles

AVX2 CPU (8-wide)

void relu_avx2(float* out, const float* in, int n) {
    __m256 zero = _mm256_setzero_ps();
    for (int i = 0; i < n; i += 8) {
        __m256 v = _mm256_load_ps(&in[i]);
        __m256 result = _mm256_max_ps(v, zero);
        _mm256_store_ps(&out[i], result);
    }
}
// 8 elements per iteration, ~n/8 cycles (memory-bound for large n)

AVX-512 CPU (16-wide)

void relu_avx512(float* out, const float* in, int n) {
    __m512 zero = _mm512_setzero_ps();
    for (int i = 0; i < n; i += 16) {
        __m512 v = _mm512_load_ps(&in[i]);
        __m512 result = _mm512_max_ps(v, zero);
        _mm512_store_ps(&out[i], result);
    }
}
// 16 elements per iteration, ~n/16 cycles (almost certainly memory-bound)

CUDA GPU (32-wide warps)

__global__ void relu_gpu(float* out, const float* in, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = fmaxf(in[idx], 0.0f);
    }
}
// 32 elements per warp per cycle, thousands of warps in flight
// n/32 warp-instructions, but many warps execute concurrently

📊

ReLU Performance Across Architectures (n=100M floats)

Implementation	Time (ms)	Bandwidth (GB/s)	% of Peak BW
Scalar (1 core)	120	3.3	5%
AVX2 (1 core)	16	25	38%
AVX2 (16 cores)	1.8	222	76% (DDR5)
AVX-512 (1 core)	10	40	61%
AVX-512 (16 cores)	1.2	333	~95% (DDR5 peak)
A100 GPU	0.14	2857	85% (HBM2e peak)
H100 GPU	0.10	4000	~80% (HBM3 peak)

Note: ReLU is purely memory-bandwidth-bound. The GPU wins because HBM bandwidth is 10-15x higher than DDR5.

The GPU is 10x faster than a 16-core AVX-512 CPU for this memory-bound operation. The speedup comes entirely from bandwidth: the H100 has ~5 TB/s of HBM3 bandwidth versus ~300 GB/s of DDR5 bandwidth. For compute-bound operations where arithmetic intensity is high, the gap can be even larger (100-1000x for dense matrix multiply).

When CPU SIMD Is Enough vs When to Go to GPU

This is the critical decision for system architects. The answer depends on three factors: arithmetic intensity, data size, and latency requirements.

Arithmetic Intensity

Arithmetic intensity measures how much computation you do per byte of data moved. Low-intensity operations (element-wise, reductions) are memory-bandwidth-bound. High-intensity operations (matrix multiply, convolutions) are compute-bound.

CPU vs GPU Advantage by Arithmetic Intensity

(GPU speedup over 16-core CPU)

Element-wise (0.25 FLOP/byte) Bandwidth-limited

10 GPU speedup over 16-core CPU

Reduction (0.5 FLOP/byte)

8 GPU speedup over 16-core CPU

SpMV (1 FLOP/byte)

5 GPU speedup over 16-core CPU

Stencil (4 FLOP/byte)

15 GPU speedup over 16-core CPU

Convolution (10 FLOP/byte)

40 GPU speedup over 16-core CPU

GEMM (100+ FLOP/byte) Compute-limited

100 GPU speedup over 16-core CPU

Data Size

For small data (kilobytes to low megabytes), the overhead of copying data to the GPU and launching a kernel exceeds the compute savings. Kernel launch latency is typically 5-20 microseconds, and PCIe transfer adds 1-2 microseconds per kilobyte.

📊

CPU SIMD vs GPU Crossover Points

Operation	CPU SIMD Faster When	GPU Faster When	Crossover
Vector add	n < 100K	n > 100K	~100K floats (400 KB)
Dot product	n < 500K	n > 500K	~500K floats (2 MB)
Sort	n < 1M	n > 1M	~1M elements (4 MB)
Matrix multiply	n < 256	n > 256	~256x256 (256 KB)
Convolution (large image)	Never (small images)	Almost always	~64x64 image
Batch inference (batch=1)	Often (latency)	batch > 4-8	Depends on model

Note: Crossover assumes data is already on GPU for GPU case. If transfer is needed, crossover shifts to larger sizes.

Latency Requirements

For real-time applications (audio processing, game physics, interactive UIs), the round-trip latency of CPU-to-GPU data transfer can be prohibitive. CPU SIMD operates on data that is already in CPU cache with sub-microsecond latency.

💡 Decision Framework

Use CPU SIMD when:

Data is small (under 1 MB)
Latency matters more than throughput (real-time audio, UI)
Data is already on CPU and results are needed on CPU
The operation is simple enough for auto-vectorization
You need deterministic, predictable timing

Use GPU when:

Data is large (over 10 MB)
The operation has high arithmetic intensity (matrix multiply, convolution)
You can batch many operations to amortize launch overhead
The data can stay on GPU across multiple operations (pipeline)
Throughput matters more than latency

Real Benchmarks: End-to-End Comparison

Let us compare a realistic workload — batched matrix multiply for neural network inference — across implementations:

📊

Batched GEMM Performance (M=N=K=1024, batch=64)

Implementation	Time (ms)	GFLOPS	Notes
Scalar C (1 core)	8200	10	Baseline
MKL AVX2 (1 core)	82	1000	100x scalar
MKL AVX2 (16 cores)	6.2	13200	1323x scalar
MKL AVX-512 (16 cores)	4.1	19900	2000x scalar
cuBLAS (A100)	0.52	157000	15700x scalar
cuBLAS (H100, FP32)	0.31	264000	26400x scalar
cuBLAS (H100, TF32)	0.08	1022000	102200x scalar

Note: H100 TF32 uses tensor cores with reduced precision (19-bit mantissa). FP32 numbers are for CUDA cores only.

The H100 GPU with tensor cores is roughly 250x faster than a 16-core AVX-512 CPU for large matrix multiply. This is why GPU-based training and inference dominates: the workloads are fundamentally matrix-multiply-heavy, and GPUs have massive advantages for this specific operation.

However, for single-sample inference with small models, the picture is different:

📊

Single-Sample Inference Latency (Small Model, batch=1)

Implementation	Latency (us)	Throughput (samples/s)	Notes
CPU AVX2 (1 core)	45	22,222	No transfer overhead
CPU AVX-512 (1 core)	28	35,714	No transfer overhead
GPU (data on GPU)	12	83,333	Kernel launch overhead ~5us
GPU (data on CPU, needs transfer)	65	15,385	Transfer dominates

Note: For batch=1 with data on CPU, the CPU wins because GPU transfer overhead exceeds the compute savings.

Memory Alignment for SIMD

Aligned memory access is critical for SIMD performance. Misaligned loads are slower on most architectures and may fault on some ARM implementations.

// Correct alignment for each ISA
float* sse_buf  = (float*)aligned_alloc(16, n * sizeof(float));   // 16B for SSE
float* avx_buf  = (float*)aligned_alloc(32, n * sizeof(float));   // 32B for AVX/AVX2
float* avx512_buf = (float*)aligned_alloc(64, n * sizeof(float)); // 64B for AVX-512

// C++ aligned new (C++17)
auto* buf = new (std::align_val_t{64}) float[n];

// Check alignment at runtime
assert(((uintptr_t)buf & 63) == 0);  // Must be 64-byte aligned for AVX-512

📊

Alignment Impact on Throughput

Alignment	SSE Load (GB/s)	AVX2 Load (GB/s)	AVX-512 Load (GB/s)
Optimal (register-width aligned)	42	42	42
Half-aligned (e.g., 16B for AVX2)	42	39	37
Unaligned (arbitrary)	41	37	34
Cache-line split (worst case)	38	32	28

Note: Modern x86 (Skylake+) has reduced the penalty for unaligned loads, but alignment still matters for peak throughput and portability.

⚠️ Alignment on ARM

On ARMv7, unaligned NEON loads may fault or silently produce incorrect results depending on the CPU configuration. ARMv8 (AArch64) handles unaligned access correctly but with a potential performance penalty. Always align to 16 bytes for NEON code, especially if targeting older ARM hardware.

Data Layout: AoS vs SoA

The choice between Array of Structures (AoS) and Structure of Arrays (SoA) dramatically affects SIMD efficiency. SIMD works best on contiguous, homogeneous data. AoS interleaves different fields, preventing efficient vectorization.

// Array of Structures (AoS) -- poor for SIMD
struct Particle_AoS {
    float x, y, z;    // Position
    float vx, vy, vz; // Velocity
    float mass;
};
Particle_AoS particles[N];
// To SIMD-process all x coordinates: load x, skip y,z,vx,vy,vz,mass, load next x...
// Stride = 28 bytes between x values. Terrible for SIMD.

// Structure of Arrays (SoA) -- excellent for SIMD
struct Particles_SoA {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
    float mass[N];
};
Particles_SoA particles;
// To SIMD-process all x coordinates: load 8 consecutive x values. Perfect.

📊

AoS vs SoA for Particle Update (N=1M, AVX2)

Layout	Time (ms)	Bandwidth Utilization	Speedup
AoS (scalar)	4.8	15%	1.0x
AoS (auto-vec attempt)	3.2	22%	1.5x (poor vectorization)
SoA (scalar)	3.1	23%	1.55x (better cache use)
SoA (AVX2)	0.6	85%	8.0x

Note: SoA with AVX2 achieves near-peak memory bandwidth. AoS wastes bandwidth loading unused fields.

Conclusion

Vector processing spans a vast range of hardware, from 128-bit ARM NEON on a phone to 32-wide warps on a datacenter GPU. The principles are constant across all of them:

Data parallelism is the source of throughput. Whether expressed as SIMD lanes or SIMT threads, processing multiple elements per instruction is what makes modern hardware fast.
Contiguous, independent, uniform data is the prerequisite. Linked lists, loop-carried dependencies, and per-element branches all kill vectorization efficiency.
Start with auto-vectorization. Use -O3 with the appropriate target flags (-mavx2, -march=native) and verify with compiler diagnostics. For simple loops, auto-vectorization reaches 80-95% of manual intrinsics performance.
Drop to intrinsics for complex patterns. Sliding windows, custom reductions, approximate math functions, and data layout transformations often need manual SIMD code to achieve full performance.
GPU SIMT is CPU SIMD taken to the extreme. The same data-parallel thinking applies, but at 1000x the throughput for bandwidth- and compute-bound workloads. The cost is transfer latency and programming complexity.
The CPU-vs-GPU decision depends on data size and arithmetic intensity. Small data with low latency requirements belongs on CPU. Large data with high arithmetic intensity belongs on GPU. The crossover region (100 KB to 10 MB, moderate intensity) requires benchmarking for your specific workload.

The practical recommendation: understand both CPU SIMD and GPU SIMT, because modern systems use both. CPU SIMD handles preprocessing, postprocessing, and latency-sensitive paths. GPU SIMT handles the heavy compute. The best systems use each where it is strongest.