Introduction
A scalar loop summing 1,000 floats executes 1,000 add instructions at perhaps 2-4 cycles each—2,000-4,000 cycles total. An AVX2 loop processes 8 floats per instruction, dropping to 125 instructions—250-500 cycles with the same latency. An AVX-512 loop processes 16 floats per instruction—63 instructions, 126-252 cycles. A GPU warp processes 32 elements per instruction with 32-thread parallelism, achieving effective throughput of 1,000+ elements in under 50 cycles when memory-bound. The performance gap between scalar code and vector code is not 10-20%—it is 10-80x depending on the workload. Every performance-sensitive codebase in production—from JPEG decoders to neural network inference to physics simulations—uses vector processing. The only question is which variant: CPU SIMD (SSE, AVX, NEON) or GPU SIMT (warps, wavefronts).
This post covers the full vector processing landscape: CPU SIMD from SSE through AVX-512 and ARM NEON/SVE, GPU SIMT execution with warps, when compilers auto-vectorize successfully (and when they fail), when to drop to intrinsics, and the critical threshold where CPU SIMD stops being sufficient and you need a GPU.
The Concept: Why Vector Processing Exists
A scalar processor executes one operation per instruction: one add, one multiply, one load. A vector processor executes the same operation on multiple data elements simultaneously. If you need to add two arrays of 1,000 floats, a scalar processor issues 1,000 add instructions. An AVX2 processor issues 125 instructions (8 floats per instruction). An AVX-512 processor issues 63 instructions (16 floats per instruction). A GPU warp processes 32 elements per instruction.
The hardware cost of processing elements in parallel is far less than times the cost of processing one element. The control logic (instruction fetch, decode, scheduling) is shared across all lanes. Only the execution units and register file scale with width. This is why vector processing provides such high throughput per watt and per transistor.
Vector Width Across Architectures
| Architecture | Register Width | Floats/Op | Int32s/Op | Int8s/Op | Era |
|---|---|---|---|---|---|
| SSE/SSE2 | 128 bits | 4 | 4 | 16 | 1999-2001 |
| AVX/AVX2 | 256 bits | 8 | 8 | 32 | 2011-2013 |
| AVX-512 | 512 bits | 16 | 16 | 64 | 2017+ |
| ARM NEON | 128 bits | 4 | 4 | 16 | 2011+ (ARMv7/v8) |
| ARM SVE/SVE2 | 128-2048 bits | Scalable | Scalable | Scalable | 2020+ |
| RISC-V V ext. | Variable | Scalable | Scalable | Scalable | 2021+ |
| GPU warp (NVIDIA) | 32 threads | 32 | 32 | 128 (INT8 tensor) | 2006+ |
| GPU wavefront (AMD) | 64 threads | 64 | 64 | 256 | 2011+ |
CPU SIMD: SSE, AVX2, AVX-512, ARM NEON
SSE/SSE2 (128-bit)
SSE (Streaming SIMD Extensions) introduced 128-bit XMM registers that hold 4 single-precision floats or 2 doubles. SSE2 added integer operations and double-precision support. Every x86-64 processor supports SSE2 — it is the baseline for modern x86.
#include <xmmintrin.h> // SSE
#include <emmintrin.h> // SSE2
// SSE: 4-wide float add
void add_arrays_sse(float* out, const float* a, const float* b, int n) {
for (int i = 0; i < n; i += 4) {
__m128 va = _mm_load_ps(&a[i]);
__m128 vb = _mm_load_ps(&b[i]);
__m128 vc = _mm_add_ps(va, vb);
_mm_store_ps(&out[i], vc);
}
}
SSE is still relevant for code that must run on the widest range of x86 hardware. For portable libraries and system-level code, SSE2 is often the target.
AVX2 (256-bit)
AVX2 doubled the register width to 256 bits (8 floats) and added FMA (fused multiply-add) instructions. FMA computes in a single instruction with a single rounding, which doubles peak floating-point throughput compared to separate multiply and add.
#include <immintrin.h>
// AVX2 dot product with FMA
float dot_product_avx2(const float* a, const float* b, int n) {
__m256 sum = _mm256_setzero_ps();
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]);
__m256 vb = _mm256_load_ps(&b[i]);
sum = _mm256_fmadd_ps(va, vb, sum); // sum += a * b (fused)
}
// Horizontal reduction: sum all 8 lanes
__m128 hi = _mm256_extractf128_ps(sum, 1);
__m128 lo = _mm256_castps256_ps128(sum);
__m128 s = _mm_add_ps(hi, lo); // 4 elements
s = _mm_hadd_ps(s, s); // 2 elements
s = _mm_hadd_ps(s, s); // 1 element
return _mm_cvtss_f32(s);
}
AVX2 is the sweet spot for most x86 SIMD code today. It is supported by all processors from Intel Haswell (2013) and AMD Zen 2 (2019) onward, which covers essentially all server and desktop hardware in production.
FMA doubles the peak FLOP rate compared to separate multiply and add. A Zen 4 core can execute two 256-bit FMA instructions per cycle, yielding FLOPS per cycle per core (where the factor of 2 in FMA counts both the multiply and add). Without FMA, peak throughput halves. Always use FMA when accumulating products (dot products, matrix multiply, convolutions).
AVX-512 (512-bit)
AVX-512 doubles the width again to 512 bits (16 floats) and adds powerful features: predicated execution with mask registers, gather/scatter instructions, and conflict detection. However, AVX-512 support is fragmented:
- Intel Skylake-SP (2017): First server support
- Intel Ice Lake (2019): Desktop support
- AMD Zen 4 (2022): First AMD support (256-bit execution internally — two 256-bit uops per 512-bit instruction)
- Intel Alder Lake/Raptor Lake: Disabled on efficiency cores
#include <immintrin.h>
// AVX-512 masked conditional processing
void threshold_avx512(float* out, const float* in, float thresh, int n) {
__m512 vthresh = _mm512_set1_ps(thresh);
for (int i = 0; i < n; i += 16) {
__m512 v = _mm512_load_ps(&in[i]);
// Create mask: which elements exceed threshold?
__mmask16 mask = _mm512_cmp_ps_mask(v, vthresh, _CMP_GT_OQ);
// Store only elements that pass the threshold
// Others remain unchanged in output
_mm512_mask_storeu_ps(&out[i], mask, v);
}
}
// AVX-512 gather: load from non-contiguous addresses
void gather_example(float* out, const float* table, const int* indices, int n) {
for (int i = 0; i < n; i += 16) {
__m512i vidx = _mm512_load_si512(&indices[i]);
__m512 gathered = _mm512_i32gather_ps(vidx, table, 4); // scale=4 (float)
_mm512_store_ps(&out[i], gathered);
}
}
AVX-512 vs AVX2 Performance
| Operation | AVX2 (ns/1M elems) | AVX-512 (ns/1M elems) | Speedup | Notes |
|---|---|---|---|---|
| Array add | 125 | 68 | 1.84x | Near-ideal 2x |
| Dot product | 210 | 115 | 1.83x | FMA bandwidth limited |
| Masked store | 380 (emulated) | 140 | 2.71x | Native masking wins big |
| Gather (random) | 850 | 520 | 1.63x | Memory-bound, less benefit |
| GEMM (256x256) | 4200 | 2400 | 1.75x | Compute-bound |
On older Intel CPUs (Skylake, Cascade Lake), executing AVX-512 instructions causes the CPU to reduce its clock frequency by 100-200 MHz to stay within power limits. This means AVX-512 code must provide at least 10-15% more throughput per cycle to break even after the frequency penalty. On newer Intel CPUs (Sapphire Rapids, Granite Rapids) and AMD Zen 4+, the throttling is reduced or eliminated.
ARM NEON (128-bit)
ARM NEON provides 128-bit SIMD on all ARMv8-A processors (Cortex-A53+, Apple M-series, Graviton). NEON is mandatory in ARMv8, so unlike x86 where you must check for SSE/AVX support, you can always assume NEON is available on 64-bit ARM.
#include <arm_neon.h>
// NEON FIR filter
void fir_filter_neon(const float* input, float* output,
const float* coeffs, int n, int taps) {
for (int i = 0; i < n; i += 4) {
float32x4_t acc = vdupq_n_f32(0.0f);
for (int t = 0; t < taps; t++) {
float32x4_t in_vec = vld1q_f32(&input[i + t]);
float32x4_t co_val = vdupq_n_f32(coeffs[t]); // Broadcast
acc = vfmaq_f32(acc, in_vec, co_val); // FMA
}
vst1q_f32(&output[i], acc);
}
}
// NEON dot product
float dot_product_neon(const float* a, const float* b, int n) {
float32x4_t sum = vdupq_n_f32(0.0f);
for (int i = 0; i < n; i += 4) {
float32x4_t va = vld1q_f32(&a[i]);
float32x4_t vb = vld1q_f32(&b[i]);
sum = vfmaq_f32(sum, va, vb);
}
// Horizontal sum
return vaddvq_f32(sum); // ARMv8 has vaddvq for horizontal reduction
}
NEON is particularly important for mobile and edge computing. Apple’s M-series chips, AWS Graviton processors, and Qualcomm Snapdragon all use NEON for compute-intensive workloads. The instruction set is clean and orthogonal, making it arguably more pleasant to write intrinsics for than x86 SSE/AVX.
ARM SVE/SVE2 (Scalable)
SVE (Scalable Vector Extension) breaks from the fixed-width SIMD tradition. Instead of specifying a register width, SVE code operates on vectors of implementation-defined length. The same binary works on hardware with 128-bit to 2048-bit vector registers.
#include <arm_sve.h>
// SVE vector add -- works at any vector length
void add_arrays_sve(float* out, const float* a, const float* b, int n) {
for (int i = 0; i < n; i += svcntw()) { // svcntw: floats per vector
svbool_t pg = svwhilelt_b32(i, n); // Predicate: active lanes
svfloat32_t va = svld1(pg, &a[i]);
svfloat32_t vb = svld1(pg, &b[i]);
svfloat32_t vc = svadd_f32_m(pg, va, vb);
svst1(pg, &out[i], vc);
}
}
SVE’s key innovation is the predicate-first programming model. Every operation takes a predicate mask that controls which lanes are active. This cleanly handles loop tails (when is not a multiple of the vector length) without separate scalar cleanup code. AWS Graviton3 implements SVE with 256-bit vectors; Fujitsu’s A64FX (used in the Fugaku supercomputer) implements it with 512-bit vectors.
SIMD Instruction Set Comparison Summary
| ISA | Width | FMA? | Masking | Gather/Scatter | Best For |
|---|---|---|---|---|---|
| SSE2 | 128b | No | Emulated | No | Baseline compatibility |
| AVX2+FMA | 256b | Yes | Emulated | Gather only | General-purpose SIMD |
| AVX-512 | 512b | Yes | Native (k-masks) | Yes | HPC, AI inference |
| NEON | 128b | Yes (v8) | Limited | No | ARM mobile/server |
| SVE/SVE2 | Scalable | Yes | Native (predicates) | Yes | Portable ARM HPC |
GPU SIMT: Warps and Wavefronts
GPUs take vector processing to an extreme. Instead of wide registers operated on by a single thread, GPUs use many threads that execute in lockstep groups called warps (NVIDIA, 32 threads) or wavefronts (AMD, 32 or 64 threads).
How SIMT Works
In SIMT (Single Instruction, Multiple Threads), each thread has its own registers and program counter (conceptually), but threads in a warp execute the same instruction at the same time. When all threads in a warp take the same branch, execution is efficient. When threads diverge (different threads take different branches), the warp executes both paths serially, masking out inactive threads.
// GPU kernel: each thread processes one element
__global__ void vector_add(float* out, const float* a, const float* b, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
out[idx] = a[idx] + b[idx];
}
}
// Launched with: vector_add<<<(n+255)/256, 256>>>(out, a, b, n);
// Each warp (32 threads) processes 32 elements simultaneously
The key difference from CPU SIMD: in CPU SIMD, the programmer explicitly loads data into vector registers and calls vector operations. In GPU SIMT, the programmer writes scalar code for one element, and the hardware automatically executes it across 32 threads. The parallelism is implicit in the programming model.
CPU SIMD (explicit vectorization):
Thread 0: load 8 floats -> multiply 8 floats -> store 8 floats
GPU SIMT (implicit vectorization):
Threads 0-31: each loads 1 float -> each multiplies 1 float -> each stores 1 float
(executed simultaneously by warp scheduler)
Warp Divergence: The GPU’s Branch Problem
CPU SIMD handles branches by computing both paths and blending with masks (AVX-512 does this natively). GPU SIMT handles branches by executing both paths serially with thread masking:
__global__ void conditional_kernel(float* out, const float* in, float thresh, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
if (in[idx] > thresh) { // Warp divergence if some threads go each way
out[idx] = in[idx] * 2.0f; // Path A: active threads execute
} else {
out[idx] = in[idx] * 0.5f; // Path B: active threads execute
}
}
}
// Worst case: 50% of warp takes each branch = 50% utilization
// Best case: all threads take same branch = 100% utilization
Warp Divergence Impact
| Divergence Pattern | Active Threads (avg) | Throughput vs No Divergence |
|---|---|---|
| No divergence (all same path) | 32/32 | 100% |
| 2-way, 50/50 split | 16/32 | ~50% |
| 4-way divergence | 8/32 | ~25% |
| Fully random (per-thread) | 1/32 | ~3% (catastrophic) |
CPU SIMD and GPU SIMT are both implementations of data parallelism. The difference is where the parallelism is expressed:
- CPU SIMD: Parallelism is explicit in the instruction (one instruction operates on N data elements in a vector register).
- GPU SIMT: Parallelism is implicit in the thread model (N threads execute the same scalar instruction simultaneously).
Both have the same fundamental limitation: branchy, irregular code performs poorly because some lanes/threads are idle during divergent execution.
Auto-Vectorization by Compilers
Modern compilers (GCC, Clang, MSVC, ICC) can automatically convert scalar loops to SIMD instructions. This works well for simple patterns and poorly for complex ones.
What Auto-Vectorizes Well
// Perfect auto-vectorization candidate: independent, contiguous, uniform
void scale_array(float* __restrict__ out, const float* __restrict__ in,
float factor, int n) {
for (int i = 0; i < n; i++) {
out[i] = in[i] * factor;
}
}
// gcc -O3 -mavx2: generates vmulps ymm (8-wide multiply)
// clang -O3 -mavx2: same, plus may unroll 2-4x
The __restrict__ qualifier tells the compiler that out and in do not overlap, enabling vectorization. Without it, the compiler must assume potential aliasing and may not vectorize.
What Does Not Auto-Vectorize
// Loop-carried dependency: cannot vectorize
void prefix_sum(float* out, const float* in, int n) {
out[0] = in[0];
for (int i = 1; i < n; i++) {
out[i] = out[i-1] + in[i]; // Depends on previous iteration
}
}
// Conditional with side effects: difficult to vectorize
void conditional_accumulate(float* sum, const float* data,
const int* flags, int n) {
for (int i = 0; i < n; i++) {
if (flags[i]) {
*sum += data[i]; // Reduction with conditional
}
}
}
// Indirect access (gather): vectorizes poorly before AVX2/AVX-512
void gather_lookup(float* out, const float* table,
const int* indices, int n) {
for (int i = 0; i < n; i++) {
out[i] = table[indices[i]]; // Random access pattern
}
}
Auto-Vectorization Success Rate by Pattern
| Pattern | GCC Auto-vec? | Clang Auto-vec? | Manual Intrinsics Needed? |
|---|---|---|---|
| Simple array operation | Yes | Yes | No (auto-vec is optimal) |
| Reduction (sum, min, max) | Yes | Yes | Rarely |
| Conditional assignment | Sometimes | Yes (if-conversion) | Sometimes |
| Sliding window (FIR) | Partial | Partial | Yes (2x+ gain) |
| Prefix sum | No | No | Yes (parallel prefix) |
| AoS to SoA conversion | No | Sometimes | Yes |
| Bit manipulation | Sometimes | Sometimes | Often |
Verifying Auto-Vectorization
Always verify that the compiler actually vectorized your loop. Do not assume.
# GCC: show which loops were vectorized
gcc -O3 -mavx2 -fopt-info-vec-optimized mycode.c
# Output: mycode.c:12:5: optimized: loop vectorized using 32 byte vectors
# GCC: show which loops FAILED to vectorize and why
gcc -O3 -mavx2 -fopt-info-vec-missed mycode.c
# Output: mycode.c:20:5: missed: couldn't vectorize loop
# mycode.c:20:5: missed: not vectorized: unsupported data-ref access
# Clang: similar but different flags
clang -O3 -mavx2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize mycode.c
SIMD (auto or manual) works when data is: (1) contiguous in memory — arrays, not linked lists; (2) independent — no loop-carried dependencies between iterations; (3) uniform — same operation on every element, minimal per-element branching. If any of these are missing, vectorization is limited or impossible.
Intrinsics for Manual Vectorization
When auto-vectorization fails or produces suboptimal code, intrinsics give you direct control over SIMD instructions while remaining in C/C++ (no assembly required).
Real-World Example: Vectorized Softmax
Softmax is ubiquitous in neural network inference. Here is a comparison of scalar, auto-vectorized, and hand-written AVX2 implementations:
// Scalar softmax (baseline)
void softmax_scalar(float* out, const float* in, int n) {
// Find max (for numerical stability)
float max_val = in[0];
for (int i = 1; i < n; i++) {
if (in[i] > max_val) max_val = in[i];
}
// Compute exp(x - max) and sum
float sum = 0;
for (int i = 0; i < n; i++) {
out[i] = expf(in[i] - max_val);
sum += out[i];
}
// Normalize
float inv_sum = 1.0f / sum;
for (int i = 0; i < n; i++) {
out[i] *= inv_sum;
}
}
// AVX2 softmax with fast exp approximation
#include <immintrin.h>
// Fast exp approximation (relative error < 0.1%)
static inline __m256 fast_exp_avx2(__m256 x) {
// Clamp to avoid overflow/underflow
x = _mm256_max_ps(x, _mm256_set1_ps(-88.0f));
x = _mm256_min_ps(x, _mm256_set1_ps(88.0f));
// exp(x) = 2^(x * log2(e))
__m256 log2e = _mm256_set1_ps(1.44269504f);
__m256 t = _mm256_mul_ps(x, log2e);
// Split into integer and fractional parts
__m256 ti = _mm256_round_ps(t, _MM_FROUND_TO_NEAREST_INT);
__m256 tf = _mm256_sub_ps(t, ti);
// 2^fraction using polynomial approximation
__m256 p = _mm256_fmadd_ps(tf, _mm256_set1_ps(0.240226507f), _mm256_set1_ps(0.693147182f));
p = _mm256_fmadd_ps(p, tf, _mm256_set1_ps(1.0f));
// 2^integer using bit manipulation
__m256i ii = _mm256_cvtps_epi32(ti);
ii = _mm256_add_epi32(ii, _mm256_set1_epi32(127));
ii = _mm256_slli_epi32(ii, 23);
__m256 pow2i = _mm256_castsi256_ps(ii);
return _mm256_mul_ps(p, pow2i);
}
void softmax_avx2(float* out, const float* in, int n) {
// Pass 1: find max (8-wide)
__m256 vmax = _mm256_load_ps(in);
for (int i = 8; i < n; i += 8) {
__m256 v = _mm256_load_ps(&in[i]);
vmax = _mm256_max_ps(vmax, v);
}
// Horizontal max reduction
__m128 hi = _mm256_extractf128_ps(vmax, 1);
__m128 lo = _mm256_castps256_ps128(vmax);
__m128 m = _mm_max_ps(hi, lo);
m = _mm_max_ps(m, _mm_shuffle_ps(m, m, _MM_SHUFFLE(1,0,3,2)));
m = _mm_max_ps(m, _mm_shuffle_ps(m, m, _MM_SHUFFLE(0,1,0,1)));
__m256 max_val = _mm256_broadcastss_ps(m);
// Pass 2: exp(x - max) and sum (8-wide)
__m256 vsum = _mm256_setzero_ps();
for (int i = 0; i < n; i += 8) {
__m256 v = _mm256_load_ps(&in[i]);
__m256 shifted = _mm256_sub_ps(v, max_val);
__m256 e = fast_exp_avx2(shifted);
_mm256_store_ps(&out[i], e);
vsum = _mm256_add_ps(vsum, e);
}
// Horizontal sum
float sum = horizontal_sum_avx2(vsum);
__m256 inv = _mm256_set1_ps(1.0f / sum);
// Pass 3: normalize (8-wide)
for (int i = 0; i < n; i += 8) {
__m256 v = _mm256_load_ps(&out[i]);
_mm256_store_ps(&out[i], _mm256_mul_ps(v, inv));
}
}
Softmax Performance (n=4096 floats, Zen 4)
| Implementation | Time (us) | Speedup | Accuracy |
|---|---|---|---|
| Scalar (gcc -O3) | 18.2 | 1.0x | Exact (libm expf) |
| Auto-vectorized (gcc -O3 -mavx2) | 6.8 | 2.7x | Exact |
| AVX2 intrinsics + fast exp | 2.1 | 8.7x | < 0.1% relative error |
| AVX-512 intrinsics + fast exp | 1.2 | 15.2x | < 0.1% relative error |
The 3.2x gap between auto-vectorization and manual intrinsics comes from the fast exp approximation. The compiler cannot replace expf() with a less-accurate-but-faster version — that is a semantic change that only the programmer can make. This is a common pattern: the biggest wins from intrinsics often come not from better SIMD usage, but from algorithmic changes that are only possible with manual control.
Mapping the Same Algorithm: CPU SIMD vs GPU
Let us trace how a simple algorithm — element-wise ReLU () — maps to different hardware:
Scalar CPU
void relu_scalar(float* out, const float* in, int n) {
for (int i = 0; i < n; i++) {
out[i] = in[i] > 0 ? in[i] : 0;
}
}
// 1 element per iteration, ~n cycles
AVX2 CPU (8-wide)
void relu_avx2(float* out, const float* in, int n) {
__m256 zero = _mm256_setzero_ps();
for (int i = 0; i < n; i += 8) {
__m256 v = _mm256_load_ps(&in[i]);
__m256 result = _mm256_max_ps(v, zero);
_mm256_store_ps(&out[i], result);
}
}
// 8 elements per iteration, ~n/8 cycles (memory-bound for large n)
AVX-512 CPU (16-wide)
void relu_avx512(float* out, const float* in, int n) {
__m512 zero = _mm512_setzero_ps();
for (int i = 0; i < n; i += 16) {
__m512 v = _mm512_load_ps(&in[i]);
__m512 result = _mm512_max_ps(v, zero);
_mm512_store_ps(&out[i], result);
}
}
// 16 elements per iteration, ~n/16 cycles (almost certainly memory-bound)
CUDA GPU (32-wide warps)
__global__ void relu_gpu(float* out, const float* in, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
out[idx] = fmaxf(in[idx], 0.0f);
}
}
// 32 elements per warp per cycle, thousands of warps in flight
// n/32 warp-instructions, but many warps execute concurrently
ReLU Performance Across Architectures (n=100M floats)
| Implementation | Time (ms) | Bandwidth (GB/s) | % of Peak BW |
|---|---|---|---|
| Scalar (1 core) | 120 | 3.3 | 5% |
| AVX2 (1 core) | 16 | 25 | 38% |
| AVX2 (16 cores) | 1.8 | 222 | 76% (DDR5) |
| AVX-512 (1 core) | 10 | 40 | 61% |
| AVX-512 (16 cores) | 1.2 | 333 | ~95% (DDR5 peak) |
| A100 GPU | 0.14 | 2857 | 85% (HBM2e peak) |
| H100 GPU | 0.10 | 4000 | ~80% (HBM3 peak) |
The GPU is 10x faster than a 16-core AVX-512 CPU for this memory-bound operation. The speedup comes entirely from bandwidth: the H100 has ~5 TB/s of HBM3 bandwidth versus ~300 GB/s of DDR5 bandwidth. For compute-bound operations where arithmetic intensity is high, the gap can be even larger (100-1000x for dense matrix multiply).
When CPU SIMD Is Enough vs When to Go to GPU
This is the critical decision for system architects. The answer depends on three factors: arithmetic intensity, data size, and latency requirements.
Arithmetic Intensity
Arithmetic intensity measures how much computation you do per byte of data moved. Low-intensity operations (element-wise, reductions) are memory-bandwidth-bound. High-intensity operations (matrix multiply, convolutions) are compute-bound.
CPU vs GPU Advantage by Arithmetic Intensity
(GPU speedup over 16-core CPU)Data Size
For small data (kilobytes to low megabytes), the overhead of copying data to the GPU and launching a kernel exceeds the compute savings. Kernel launch latency is typically 5-20 microseconds, and PCIe transfer adds 1-2 microseconds per kilobyte.
CPU SIMD vs GPU Crossover Points
| Operation | CPU SIMD Faster When | GPU Faster When | Crossover |
|---|---|---|---|
| Vector add | n < 100K | n > 100K | ~100K floats (400 KB) |
| Dot product | n < 500K | n > 500K | ~500K floats (2 MB) |
| Sort | n < 1M | n > 1M | ~1M elements (4 MB) |
| Matrix multiply | n < 256 | n > 256 | ~256x256 (256 KB) |
| Convolution (large image) | Never (small images) | Almost always | ~64x64 image |
| Batch inference (batch=1) | Often (latency) | batch > 4-8 | Depends on model |
Latency Requirements
For real-time applications (audio processing, game physics, interactive UIs), the round-trip latency of CPU-to-GPU data transfer can be prohibitive. CPU SIMD operates on data that is already in CPU cache with sub-microsecond latency.
Use CPU SIMD when:
- Data is small (under 1 MB)
- Latency matters more than throughput (real-time audio, UI)
- Data is already on CPU and results are needed on CPU
- The operation is simple enough for auto-vectorization
- You need deterministic, predictable timing
Use GPU when:
- Data is large (over 10 MB)
- The operation has high arithmetic intensity (matrix multiply, convolution)
- You can batch many operations to amortize launch overhead
- The data can stay on GPU across multiple operations (pipeline)
- Throughput matters more than latency
Real Benchmarks: End-to-End Comparison
Let us compare a realistic workload — batched matrix multiply for neural network inference — across implementations:
Batched GEMM Performance (M=N=K=1024, batch=64)
| Implementation | Time (ms) | GFLOPS | Notes |
|---|---|---|---|
| Scalar C (1 core) | 8200 | 10 | Baseline |
| MKL AVX2 (1 core) | 82 | 1000 | 100x scalar |
| MKL AVX2 (16 cores) | 6.2 | 13200 | 1323x scalar |
| MKL AVX-512 (16 cores) | 4.1 | 19900 | 2000x scalar |
| cuBLAS (A100) | 0.52 | 157000 | 15700x scalar |
| cuBLAS (H100, FP32) | 0.31 | 264000 | 26400x scalar |
| cuBLAS (H100, TF32) | 0.08 | 1022000 | 102200x scalar |
The H100 GPU with tensor cores is roughly 250x faster than a 16-core AVX-512 CPU for large matrix multiply. This is why GPU-based training and inference dominates: the workloads are fundamentally matrix-multiply-heavy, and GPUs have massive advantages for this specific operation.
However, for single-sample inference with small models, the picture is different:
Single-Sample Inference Latency (Small Model, batch=1)
| Implementation | Latency (us) | Throughput (samples/s) | Notes |
|---|---|---|---|
| CPU AVX2 (1 core) | 45 | 22,222 | No transfer overhead |
| CPU AVX-512 (1 core) | 28 | 35,714 | No transfer overhead |
| GPU (data on GPU) | 12 | 83,333 | Kernel launch overhead ~5us |
| GPU (data on CPU, needs transfer) | 65 | 15,385 | Transfer dominates |
Memory Alignment for SIMD
Aligned memory access is critical for SIMD performance. Misaligned loads are slower on most architectures and may fault on some ARM implementations.
// Correct alignment for each ISA
float* sse_buf = (float*)aligned_alloc(16, n * sizeof(float)); // 16B for SSE
float* avx_buf = (float*)aligned_alloc(32, n * sizeof(float)); // 32B for AVX/AVX2
float* avx512_buf = (float*)aligned_alloc(64, n * sizeof(float)); // 64B for AVX-512
// C++ aligned new (C++17)
auto* buf = new (std::align_val_t{64}) float[n];
// Check alignment at runtime
assert(((uintptr_t)buf & 63) == 0); // Must be 64-byte aligned for AVX-512
Alignment Impact on Throughput
| Alignment | SSE Load (GB/s) | AVX2 Load (GB/s) | AVX-512 Load (GB/s) |
|---|---|---|---|
| Optimal (register-width aligned) | 42 | 42 | 42 |
| Half-aligned (e.g., 16B for AVX2) | 42 | 39 | 37 |
| Unaligned (arbitrary) | 41 | 37 | 34 |
| Cache-line split (worst case) | 38 | 32 | 28 |
On ARMv7, unaligned NEON loads may fault or silently produce incorrect results depending on the CPU configuration. ARMv8 (AArch64) handles unaligned access correctly but with a potential performance penalty. Always align to 16 bytes for NEON code, especially if targeting older ARM hardware.
Data Layout: AoS vs SoA
The choice between Array of Structures (AoS) and Structure of Arrays (SoA) dramatically affects SIMD efficiency. SIMD works best on contiguous, homogeneous data. AoS interleaves different fields, preventing efficient vectorization.
// Array of Structures (AoS) -- poor for SIMD
struct Particle_AoS {
float x, y, z; // Position
float vx, vy, vz; // Velocity
float mass;
};
Particle_AoS particles[N];
// To SIMD-process all x coordinates: load x, skip y,z,vx,vy,vz,mass, load next x...
// Stride = 28 bytes between x values. Terrible for SIMD.
// Structure of Arrays (SoA) -- excellent for SIMD
struct Particles_SoA {
float x[N], y[N], z[N];
float vx[N], vy[N], vz[N];
float mass[N];
};
Particles_SoA particles;
// To SIMD-process all x coordinates: load 8 consecutive x values. Perfect.
AoS vs SoA for Particle Update (N=1M, AVX2)
| Layout | Time (ms) | Bandwidth Utilization | Speedup |
|---|---|---|---|
| AoS (scalar) | 4.8 | 15% | 1.0x |
| AoS (auto-vec attempt) | 3.2 | 22% | 1.5x (poor vectorization) |
| SoA (scalar) | 3.1 | 23% | 1.55x (better cache use) |
| SoA (AVX2) | 0.6 | 85% | 8.0x |
Conclusion
Vector processing spans a vast range of hardware, from 128-bit ARM NEON on a phone to 32-wide warps on a datacenter GPU. The principles are constant across all of them:
-
Data parallelism is the source of throughput. Whether expressed as SIMD lanes or SIMT threads, processing multiple elements per instruction is what makes modern hardware fast.
-
Contiguous, independent, uniform data is the prerequisite. Linked lists, loop-carried dependencies, and per-element branches all kill vectorization efficiency.
-
Start with auto-vectorization. Use
-O3with the appropriate target flags (-mavx2,-march=native) and verify with compiler diagnostics. For simple loops, auto-vectorization reaches 80-95% of manual intrinsics performance. -
Drop to intrinsics for complex patterns. Sliding windows, custom reductions, approximate math functions, and data layout transformations often need manual SIMD code to achieve full performance.
-
GPU SIMT is CPU SIMD taken to the extreme. The same data-parallel thinking applies, but at 1000x the throughput for bandwidth- and compute-bound workloads. The cost is transfer latency and programming complexity.
-
The CPU-vs-GPU decision depends on data size and arithmetic intensity. Small data with low latency requirements belongs on CPU. Large data with high arithmetic intensity belongs on GPU. The crossover region (100 KB to 10 MB, moderate intensity) requires benchmarking for your specific workload.
The practical recommendation: understand both CPU SIMD and GPU SIMT, because modern systems use both. CPU SIMD handles preprocessing, postprocessing, and latency-sensitive paths. GPU SIMT handles the heavy compute. The best systems use each where it is strongest.