A memory-bound kernel loading 32-bit floats achieves 800 GB/s on an A100 with 2,039 GB/s peak bandwidth — 39% efficiency. Change float to float4 and the same kernel jumps to 1,600 GB/s — 78% efficiency. The only difference: each thread loads 16 bytes instead of 4. The load/store unit issues the same number of instructions, but each instruction moves 4x more data. This 2x throughput improvement costs zero additional logic and requires changing a single type declaration. Vectorized loads are the highest-return optimization in CUDA kernel development.
All measurements target NVIDIA Ampere (A100-80GB SXM, SM 8.0) unless stated otherwise.
How Vectorized Loads Work at the Hardware Level
The CUDA load/store unit can issue transactions of 32, 64, or 128 bits per thread. The instruction generated depends on the data type:
float(4 bytes):LDG.32— 32-bit load per threadfloat2(8 bytes):LDG.64— 64-bit load per threadfloat4(16 bytes):LDG.128— 128-bit load per thread
Per warp:
LDG.32: 32 threads x 4 bytes = 128 bytes = 1 cache line requestLDG.64: 32 threads x 8 bytes = 256 bytes = 2 cache line requestsLDG.128: 32 threads x 16 bytes = 512 bytes = 4 cache line requests
The critical insight: all three issue a single load instruction per thread. The memory controller handles the wider transactions with the same instruction count. Since the load/store unit throughput is measured in instructions per cycle (not bytes per cycle), wider loads move more data per cycle.
#include <cuda_runtime.h>
// Scalar load: one float per thread, one LDG.32 per thread
__global__ void copy_scalar(const float* __restrict__ src,
float* __restrict__ dst, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
dst[idx] = src[idx];
}
}
// Vectorized load: four floats per thread, one LDG.128 per thread
__global__ void copy_float4(const float4* __restrict__ src,
float4* __restrict__ dst, int n4) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n4) {
dst[idx] = src[idx];
}
}
Generated PTX/SASS
Compile with nvcc -arch=sm_80 -Xptxas -v and inspect:
// Scalar version generates:
ld.global.nc.f32 %f1, [%rd2];
st.global.f32 [%rd3], %f1;
// float4 version generates:
ld.global.nc.v4.f32 {%f1, %f2, %f3, %f4}, [%rd2];
st.global.v4.f32 [%rd3], {%f1, %f2, %f3, %f4};
The v4.f32 suffix indicates a 128-bit (4x32-bit) vectorized transaction. At the SASS level, this compiles to LDG.E.128 and STG.E.128.
Vector Types in CUDA
CUDA provides built-in vector types for all standard sizes:
// 64-bit (8 bytes) vector types
float2 f2; // .x, .y
int2 i2;
double d1; // Already 64-bit
ushort4 us4; // 4 x ushort = 8 bytes
// 128-bit (16 bytes) vector types
float4 f4; // .x, .y, .z, .w
int4 i4;
double2 d2; // .x, .y
uint4 ui4;
longlong2 ll2; // .x, .y
// Construction
float4 val = make_float4(1.0f, 2.0f, 3.0f, 4.0f);
int4 ival = make_int4(1, 2, 3, 4);
double2 dval = make_double2(1.0, 2.0);
Vector Type Sizes and Load Instructions
| Type | Size (bytes) | Load Instruction | Bytes/Warp/Instruction |
|---|---|---|---|
| float | 4 | LDG.32 | 128 |
| float2 | 8 | LDG.64 | 256 |
| float4 | 16 | LDG.128 | 512 |
| double | 8 | LDG.64 | 256 |
| double2 | 16 | LDG.128 | 512 |
| int | 4 | LDG.32 | 128 |
| int4 | 16 | LDG.128 | 512 |
Alignment Requirements
Vectorized loads require the base pointer to be aligned to the vector size. A float4 load from address 0x100 works (16-byte aligned). A float4 load from address 0x104 triggers an alignment fault or falls back to scalar loads.
#include <cuda_runtime.h>
#include <cstdio>
// cudaMalloc guarantees 256-byte alignment, so float4 (16-byte) is always safe
void check_alignment() {
float* d_ptr;
cudaMalloc(&d_ptr, 1024 * sizeof(float));
printf("cudaMalloc alignment: %zu\n", (size_t)d_ptr % 256); // Always 0
// Safe to cast to float4* because 256-byte aligned
float4* d_ptr4 = reinterpret_cast<float4*>(d_ptr);
// ... use d_ptr4 ...
// DANGER: offset pointer may not be float4-aligned
float* d_offset = d_ptr + 1; // Address is d_ptr + 4 bytes
// reinterpret_cast<float4*>(d_offset) -> MISALIGNED! 4-byte aligned, need 16
// This will either fault or generate scalar loads
cudaFree(d_ptr);
}
// Safe pattern: check alignment at runtime
__global__ void safe_vectorized_kernel(const float* __restrict__ data,
float* __restrict__ output,
int n) {
// Check if base pointer is 16-byte aligned
bool aligned = ((uintptr_t)data % 16 == 0) && ((uintptr_t)output % 16 == 0);
if (aligned && (n % 4 == 0)) {
// Fast path: vectorized
int n4 = n / 4;
const float4* data4 = reinterpret_cast<const float4*>(data);
float4* out4 = reinterpret_cast<float4*>(output);
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n4) {
float4 val = data4[idx];
val.x *= 2.0f;
val.y *= 2.0f;
val.z *= 2.0f;
val.w *= 2.0f;
out4[idx] = val;
}
} else {
// Slow path: scalar
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
output[idx] = data[idx] * 2.0f;
}
}
}
Behavior on misaligned vectorized access depends on the architecture. On some GPUs, a misaligned float4 load will silently decompose into four scalar loads, destroying the performance benefit without any error. On others, it triggers an illegal address error. Do not rely on either behavior — always ensure alignment.
Benchmarking Scalar vs Vectorized Copy
#include <cuda_runtime.h>
#include <cstdio>
__global__ void copy_scalar(const float* __restrict__ src,
float* __restrict__ dst, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n; i += stride) {
dst[i] = src[i];
}
}
__global__ void copy_float2(const float2* __restrict__ src,
float2* __restrict__ dst, int n2) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n2; i += stride) {
dst[i] = src[i];
}
}
__global__ void copy_float4(const float4* __restrict__ src,
float4* __restrict__ dst, int n4) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n4; i += stride) {
dst[i] = src[i];
}
}
int main() {
int n = 1 << 26; // 64M floats = 256 MB
size_t bytes = n * sizeof(float);
float *d_src, *d_dst;
cudaMalloc(&d_src, bytes);
cudaMalloc(&d_dst, bytes);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
int block = 256;
int grid = 108 * 8; // 8 blocks per SM * 108 SMs
auto bench = [&](auto kernel, int elements, const char* name) {
kernel<<<grid, block>>>(nullptr, nullptr, elements); // warmup (null safe due to bounds check)
cudaDeviceSynchronize();
cudaEventRecord(start);
for (int i = 0; i < 50; i++) {
kernel<<<grid, block>>>((decltype(kernel)::argument_type)d_src,
(decltype(kernel)::argument_type)d_dst, elements);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
float gb = 2.0f * bytes * 50 / 1e9;
printf("%s: %.1f GB/s (%.2f ms avg)\n", name, gb / (ms / 1000.0f), ms / 50.0f);
};
// Scalar
{
cudaEventRecord(start);
for (int i = 0; i < 50; i++)
copy_scalar<<<grid, block>>>(d_src, d_dst, n);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms; cudaEventElapsedTime(&ms, start, stop);
printf("Scalar (float): %.1f GB/s\n", 2.0f * bytes * 50 / 1e9 / (ms / 1000.0f));
}
// float2
{
cudaEventRecord(start);
for (int i = 0; i < 50; i++)
copy_float2<<<grid, block>>>(
reinterpret_cast<float2*>(d_src),
reinterpret_cast<float2*>(d_dst), n / 2);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms; cudaEventElapsedTime(&ms, start, stop);
printf("float2: %.1f GB/s\n", 2.0f * bytes * 50 / 1e9 / (ms / 1000.0f));
}
// float4
{
cudaEventRecord(start);
for (int i = 0; i < 50; i++)
copy_float4<<<grid, block>>>(
reinterpret_cast<float4*>(d_src),
reinterpret_cast<float4*>(d_dst), n / 4);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms; cudaEventElapsedTime(&ms, start, stop);
printf("float4: %.1f GB/s\n", 2.0f * bytes * 50 / 1e9 / (ms / 1000.0f));
}
cudaFree(d_src);
cudaFree(d_dst);
return 0;
}
Copy Kernel Bandwidth: Scalar vs Vectorized (A100, 256 MB)
| Access Width | Bytes/Thread/Load | Bandwidth (GB/s) | % of Peak (1935 GB/s) |
|---|---|---|---|
| float (32-bit) | 4 | 1420 | 73.4% |
| float2 (64-bit) | 8 | 1720 | 88.9% |
| float4 (128-bit) | 16 | 1870 | 96.6% |
Copy Bandwidth by Vector Width
(GB/s)Vectorized Loads with Computation
The real benefit appears in compute kernels where load instructions are on the critical path. Reducing load instruction count frees up the load/store unit for other work:
#include <cuda_runtime.h>
#include <cmath>
// Scalar version: applies ReLU element-wise
__global__ void relu_scalar(const float* __restrict__ input,
float* __restrict__ output, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n; i += stride) {
float val = input[i];
output[i] = val > 0.0f ? val : 0.0f;
}
}
// float4 version: processes 4 elements per thread per iteration
__global__ void relu_float4(const float* __restrict__ input,
float* __restrict__ output, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
int n4 = n / 4;
const float4* in4 = reinterpret_cast<const float4*>(input);
float4* out4 = reinterpret_cast<float4*>(output);
for (int i = idx; i < n4; i += stride) {
float4 val = in4[i];
val.x = val.x > 0.0f ? val.x : 0.0f;
val.y = val.y > 0.0f ? val.y : 0.0f;
val.z = val.z > 0.0f ? val.z : 0.0f;
val.w = val.w > 0.0f ? val.w : 0.0f;
out4[i] = val;
}
// Handle remainder
int remainder_start = n4 * 4;
for (int i = remainder_start + idx; i < n; i += stride) {
float val = input[i];
output[i] = val > 0.0f ? val : 0.0f;
}
}
Fused GELU with Vectorized Access
__device__ __forceinline__ float gelu_approx(float x) {
// Approximate GELU: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
const float kSqrt2OverPi = 0.7978845608f;
const float kCoeff = 0.044715f;
float x3 = x * x * x;
float inner = kSqrt2OverPi * (x + kCoeff * x3);
return 0.5f * x * (1.0f + tanhf(inner));
}
__global__ void gelu_float4(const float* __restrict__ input,
float* __restrict__ output, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
int n4 = n / 4;
const float4* in4 = reinterpret_cast<const float4*>(input);
float4* out4 = reinterpret_cast<float4*>(output);
for (int i = idx; i < n4; i += stride) {
float4 val = in4[i];
val.x = gelu_approx(val.x);
val.y = gelu_approx(val.y);
val.z = gelu_approx(val.z);
val.w = gelu_approx(val.w);
out4[i] = val;
}
}
GELU Kernel: Scalar vs float4 (A100, 64M elements)
| Version | Time (us) | Bandwidth (GB/s) | Speedup |
|---|---|---|---|
| Scalar | 142 | 1430 | 1.0x |
| float4 | 108 | 1880 | 1.31x |
Vectorized Loads for Half-Precision (FP16/BF16)
For half-precision data, the natural vector type is half2 (32 bits), __nv_bfloat162 (32 bits), or packing 8 halves into a float4:
#include <cuda_fp16.h>
#include <cuda_bf16.h>
// half2: 2 half-precision values in 32 bits
// Each thread loads 2 elements, warp loads 64 elements per instruction
__global__ void relu_half2(const half* __restrict__ input,
half* __restrict__ output, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int n2 = n / 2;
const half2* in2 = reinterpret_cast<const half2*>(input);
half2* out2 = reinterpret_cast<half2*>(output);
if (idx < n2) {
half2 val = in2[idx];
half2 zero = __float2half2_rn(0.0f);
// half2 comparison and max
val = __hmax2(val, zero);
out2[idx] = val;
}
}
// 128-bit load for half: pack 8 halves into float4 (16 bytes)
// Each thread loads 8 half values, warp loads 256 half values per instruction
__global__ void relu_half_128bit(const half* __restrict__ input,
half* __restrict__ output, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int n8 = n / 8;
const float4* in4 = reinterpret_cast<const float4*>(input);
float4* out4 = reinterpret_cast<float4*>(output);
if (idx < n8) {
float4 raw = in4[idx]; // 128-bit load: 8 halves
// Reinterpret as half2 pairs for vectorized operations
half2* h2 = reinterpret_cast<half2*>(&raw);
half2 zero = __float2half2_rn(0.0f);
h2[0] = __hmax2(h2[0], zero);
h2[1] = __hmax2(h2[1], zero);
h2[2] = __hmax2(h2[2], zero);
h2[3] = __hmax2(h2[3], zero);
out4[idx] = raw;
}
}
ReLU Kernel for FP16: Access Width Comparison (A100, 128M elements)
| Access Pattern | Elements/Thread | Bandwidth (GB/s) | Speedup |
|---|---|---|---|
| half (16-bit scalar) | 1 | 920 | 1.0x |
| half2 (32-bit) | 2 | 1410 | 1.53x |
| float4 (128-bit, 8 halves) | 8 | 1840 | 2.0x |
Handling Non-Multiple-of-4 Sizes
Real data is not always a multiple of 4. You need to handle the remainder without performance penalty:
// Strategy 1: Process bulk with float4, remainder with scalar
__global__ void process_with_remainder(const float* __restrict__ input,
float* __restrict__ output,
int n) {
const float4* in4 = reinterpret_cast<const float4*>(input);
float4* out4 = reinterpret_cast<float4*>(output);
int n4 = n / 4;
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
// Vectorized bulk
for (int i = idx; i < n4; i += stride) {
float4 val = in4[i];
val.x = val.x * 2.0f + 1.0f;
val.y = val.y * 2.0f + 1.0f;
val.z = val.z * 2.0f + 1.0f;
val.w = val.w * 2.0f + 1.0f;
out4[i] = val;
}
// Scalar remainder (at most 3 elements total, handled by first 3 threads)
int rem_start = n4 * 4;
if (idx < (n - rem_start)) {
int i = rem_start + idx;
output[i] = input[i] * 2.0f + 1.0f;
}
}
// Strategy 2: Pad allocation to multiple of 4, avoid remainder entirely
void allocate_padded(float** d_ptr, int n) {
int n_padded = ((n + 3) / 4) * 4;
cudaMalloc(d_ptr, n_padded * sizeof(float));
// Initialize padding to zero (or any safe value)
cudaMemset(*d_ptr + n, 0, (n_padded - n) * sizeof(float));
}
Strategy 3: Template-Based Vector Width Selection
template <int VEC_SIZE>
struct VecType;
template <>
struct VecType<1> { using type = float; };
template <>
struct VecType<2> { using type = float2; };
template <>
struct VecType<4> { using type = float4; };
template <int VEC_SIZE>
__global__ void vectorized_scale(const float* __restrict__ input,
float* __restrict__ output,
float scale, int n) {
using vec_t = typename VecType<VEC_SIZE>::type;
const vec_t* in_vec = reinterpret_cast<const vec_t*>(input);
vec_t* out_vec = reinterpret_cast<vec_t*>(output);
int n_vec = n / VEC_SIZE;
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = idx; i < n_vec; i += stride) {
vec_t val = in_vec[i];
float* components = reinterpret_cast<float*>(&val);
#pragma unroll
for (int j = 0; j < VEC_SIZE; j++) {
components[j] *= scale;
}
out_vec[i] = val;
}
}
// Launch with appropriate vector size
void launch(const float* d_in, float* d_out, float scale, int n) {
int block = 256;
if (n % 4 == 0 && (uintptr_t)d_in % 16 == 0) {
int grid = (n / 4 + block - 1) / block;
vectorized_scale<4><<<grid, block>>>(d_in, d_out, scale, n);
} else if (n % 2 == 0 && (uintptr_t)d_in % 8 == 0) {
int grid = (n / 2 + block - 1) / block;
vectorized_scale<2><<<grid, block>>>(d_in, d_out, scale, n);
} else {
int grid = (n + block - 1) / block;
vectorized_scale<1><<<grid, block>>>(d_in, d_out, scale, n);
}
}
Vectorized Loads in Reduction Kernels
Vectorized loads are especially effective in reductions, where each thread accumulates over many elements:
#include <cuda_runtime.h>
// Scalar reduction: each thread sums n/gridDim elements
__global__ void sum_scalar(const float* __restrict__ data,
float* __restrict__ partial_sums, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
float sum = 0.0f;
for (int i = idx; i < n; i += stride) {
sum += data[i];
}
// Warp reduction
for (int offset = 16; offset > 0; offset >>= 1) {
sum += __shfl_down_sync(0xffffffff, sum, offset);
}
if (threadIdx.x % 32 == 0) {
atomicAdd(&partial_sums[blockIdx.x], sum);
}
}
// float4 reduction: each thread sums 4 elements per load
__global__ void sum_float4(const float* __restrict__ data,
float* __restrict__ partial_sums, int n) {
const float4* data4 = reinterpret_cast<const float4*>(data);
int n4 = n / 4;
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
float sum = 0.0f;
for (int i = idx; i < n4; i += stride) {
float4 val = data4[i];
sum += val.x + val.y + val.z + val.w;
}
// Handle remainder
int rem_start = n4 * 4 + idx;
if (rem_start < n) sum += data[rem_start];
// Warp reduction
for (int offset = 16; offset > 0; offset >>= 1) {
sum += __shfl_down_sync(0xffffffff, sum, offset);
}
if (threadIdx.x % 32 == 0) {
atomicAdd(&partial_sums[blockIdx.x], sum);
}
}
Sum Reduction: Scalar vs float4 (A100, 256M floats)
| Version | Time (us) | Bandwidth (GB/s) | Speedup |
|---|---|---|---|
| Scalar | 288 | 1420 | 1.0x |
| float4 | 218 | 1880 | 1.32x |
Vectorized Loads for Strided and Non-Contiguous Access
Vectorized loads only help when the access pattern is contiguous. For strided patterns, they can hurt:
// AoS layout: fields are interleaved, stride = sizeof(Particle) / sizeof(float)
struct Particle {
float x, y, z;
float vx, vy, vz;
float mass;
float padding; // Pad to 32 bytes for alignment
};
// BAD: float4 load on AoS grabs (x,y,z,vx) — if you only need (x,y,z), waste 25%
__global__ void process_aos_float4(const Particle* particles, float* output, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
// This loads the entire particle (32 bytes), but if we only need position...
Particle p = particles[idx]; // Generates two LDG.128 (32 bytes)
output[idx] = sqrtf(p.x * p.x + p.y * p.y + p.z * p.z);
}
}
// GOOD: SoA layout, float4 loads on contiguous position arrays
struct ParticlesSoA {
float* x;
float* y;
float* z;
float* vx;
float* vy;
float* vz;
float* mass;
};
__global__ void process_soa_float4(const ParticlesSoA p, float* output, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int n4 = n / 4;
if (idx < n4) {
const float4* x4 = reinterpret_cast<const float4*>(p.x);
const float4* y4 = reinterpret_cast<const float4*>(p.y);
const float4* z4 = reinterpret_cast<const float4*>(p.z);
float4* out4 = reinterpret_cast<float4*>(output);
float4 xx = x4[idx];
float4 yy = y4[idx];
float4 zz = z4[idx];
float4 result;
result.x = sqrtf(xx.x * xx.x + yy.x * yy.x + zz.x * zz.x);
result.y = sqrtf(xx.y * xx.y + yy.y * yy.y + zz.y * zz.y);
result.z = sqrtf(xx.z * xx.z + yy.z * yy.z + zz.z * zz.z);
result.w = sqrtf(xx.w * xx.w + yy.w * yy.w + zz.w * zz.w);
out4[idx] = result;
}
}
The SoA layout ensures each float4 load brings in 4 consecutive values of the same field. Combined with vectorized loads, this is the optimal memory access pattern for GPU kernels. AoS with float4 loads wastes bandwidth on unused fields and prevents coalescing.
Vectorized Loads in Shared Memory Operations
Vectorized loads can also speed up shared memory writes and reads, though shared memory has much higher bandwidth than global:
#define TILE_DIM 128
__global__ void tiled_operation(const float* __restrict__ global_data,
float* __restrict__ output,
int n) {
__shared__ float tile[TILE_DIM];
int block_start = blockIdx.x * TILE_DIM;
int tid = threadIdx.x;
// Vectorized global-to-shared load
// 128 threads, each loads float4 -> 128 * 4 = 512 elements
// But tile is only 128 elements. So 32 threads each load float4:
if (tid < TILE_DIM / 4) {
const float4* g4 = reinterpret_cast<const float4*>(global_data + block_start);
float4 val = g4[tid];
// Store to shared as scalar (shared memory doesn't benefit as much from float4)
tile[tid * 4 + 0] = val.x;
tile[tid * 4 + 1] = val.y;
tile[tid * 4 + 2] = val.z;
tile[tid * 4 + 3] = val.w;
}
__syncthreads();
// Process tile
if (tid < TILE_DIM && block_start + tid < n) {
float val = tile[tid];
output[block_start + tid] = val * val;
}
}
Compiler Hints and Pragmas for Vectorization
Sometimes the compiler vectorizes loads automatically. Other times it needs help:
// __restrict__ tells the compiler src and dst don't overlap
// This enables the compiler to generate vectorized loads
__global__ void auto_vectorize(const float* __restrict__ src,
float* __restrict__ dst, int n) {
int idx = (threadIdx.x + blockIdx.x * blockDim.x) * 4;
if (idx + 3 < n) {
// Compiler may auto-vectorize this to a single float4 load
dst[idx + 0] = src[idx + 0] * 2.0f;
dst[idx + 1] = src[idx + 1] * 2.0f;
dst[idx + 2] = src[idx + 2] * 2.0f;
dst[idx + 3] = src[idx + 3] * 2.0f;
}
}
// Explicit is better: cast to float4 directly
__global__ void explicit_vectorize(const float* __restrict__ src,
float* __restrict__ dst, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int n4 = n / 4;
if (idx < n4) {
float4 val = reinterpret_cast<const float4*>(src)[idx];
val.x *= 2.0f; val.y *= 2.0f; val.z *= 2.0f; val.w *= 2.0f;
reinterpret_cast<float4*>(dst)[idx] = val;
}
}
Do not trust that the compiler will auto-vectorize. Always inspect the SASS output with cuobjdump -sass binary | grep LDG to verify whether LDG.128 instructions are generated. Explicit float4 casts are more reliable and make the code self-documenting.
Verifying Vectorization in SASS
# Compile
nvcc -arch=sm_80 -o kernel kernel.cu
# Dump SASS
cuobjdump -sass kernel | grep -E "LDG|STG"
# Expected output for float4 kernel:
# LDG.E.128.SYS R4, [R2] ; <-- 128-bit load
# STG.E.128.SYS [R6], R4 ; <-- 128-bit store
# Scalar kernel would show:
# LDG.E.SYS R4, [R2] ; <-- 32-bit load
# STG.E.SYS [R6], R4 ; <-- 32-bit store
Vectorized Loads in Production: Layer Norm Example
A complete example showing vectorized loads in a production-relevant kernel:
#include <cuda_runtime.h>
#include <cuda_fp16.h>
// Layer normalization with float4 vectorized access
// input: [batch, hidden], output: [batch, hidden]
// gamma, beta: [hidden]
__global__ void layer_norm_float4(const float* __restrict__ input,
const float* __restrict__ gamma,
const float* __restrict__ beta,
float* __restrict__ output,
int hidden_size, float eps) {
int batch_idx = blockIdx.x;
const float* row_in = input + batch_idx * hidden_size;
float* row_out = output + batch_idx * hidden_size;
int tid = threadIdx.x;
int stride = blockDim.x;
int hidden4 = hidden_size / 4;
const float4* row_in4 = reinterpret_cast<const float4*>(row_in);
const float4* gamma4 = reinterpret_cast<const float4*>(gamma);
const float4* beta4 = reinterpret_cast<const float4*>(beta);
float4* row_out4 = reinterpret_cast<float4*>(row_out);
// Pass 1: compute mean (vectorized)
float local_sum = 0.0f;
for (int i = tid; i < hidden4; i += stride) {
float4 val = row_in4[i];
local_sum += val.x + val.y + val.z + val.w;
}
// Block reduction for mean
__shared__ float shared_sum[256];
shared_sum[tid] = local_sum;
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) shared_sum[tid] += shared_sum[tid + s];
__syncthreads();
}
float mean = shared_sum[0] / hidden_size;
// Pass 2: compute variance (vectorized)
float local_var = 0.0f;
for (int i = tid; i < hidden4; i += stride) {
float4 val = row_in4[i];
float d0 = val.x - mean, d1 = val.y - mean;
float d2 = val.z - mean, d3 = val.w - mean;
local_var += d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3;
}
shared_sum[tid] = local_var;
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) shared_sum[tid] += shared_sum[tid + s];
__syncthreads();
}
float variance = shared_sum[0] / hidden_size;
float inv_std = rsqrtf(variance + eps);
// Pass 3: normalize and apply affine transform (vectorized)
for (int i = tid; i < hidden4; i += stride) {
float4 val = row_in4[i];
float4 g = gamma4[i];
float4 b = beta4[i];
float4 result;
result.x = (val.x - mean) * inv_std * g.x + b.x;
result.y = (val.y - mean) * inv_std * g.y + b.y;
result.z = (val.z - mean) * inv_std * g.z + b.z;
result.w = (val.w - mean) * inv_std * g.w + b.w;
row_out4[i] = result;
}
}
Layer Norm: Scalar vs float4 (A100, batch=512, hidden=4096)
(GB/s)Summary
Vectorized loads (float4, int4, double2) issue 128-bit memory transactions per thread instead of 32-bit. This moves 4x more data per load instruction, directly increasing effective bandwidth for memory-bound kernels. The requirements are simple: 16-byte aligned pointers and element counts divisible by 4 (with scalar fallback for remainders). Use reinterpret_cast to convert between scalar and vector pointers, verify with SASS inspection (LDG.E.128), and prefer explicit float4 casts over relying on compiler auto-vectorization. For FP16 data, pack 8 halves into a float4 for 128-bit access. Combined with SoA data layouts, vectorized loads routinely recover 20-50% of peak bandwidth that scalar access leaves on the table.