The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM

Part of Series GPU Hardware & AI Accelerators 33 of 30

1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

If you write CUDA kernels, memory is almost certainly your bottleneck. Not compute. Not synchronization. Memory. On a modern GPU like the H100, peak compute throughput is approximately 1,979 TFLOPS for FP16 tensor operations, but peak memory bandwidth is 3,350 GB/s. That ratio — roughly 590 FLOP per byte — means that any operation with arithmetic intensity below 590 needs data faster than HBM can deliver it. Matrix multiplications can sometimes exceed this threshold. Almost everything else — elementwise ops, reductions, normalization, attention — is memory-bound.

This post is the complete treatment of GPU memory. We will walk through every level of the hierarchy, from registers to HBM, with real measurements, code examples, and the optimization strategies that matter in practice. The goal is to give you a mental model precise enough to predict kernel performance before you write a single line of profiling code.

The 1000x Bandwidth Gap

The GPU memory hierarchy spans roughly three orders of magnitude in bandwidth and four in latency:

📊

GPU Memory Hierarchy Comparison (H100 SXM)

Memory Type	Size per SM	Total Size	Latency (cycles)	Bandwidth	Scope
Registers	256 KB	~33 MB (128 SMs)	1	~100+ TB/s (aggregate)	Thread
Shared Memory (SRAM)	Up to 228 KB	~29 MB (128 SMs)	~20-30	~3+ TB/s (aggregate)	Block
L1 Cache	256 KB (shared pool)	--	~30-40	~1.5-2 TB/s	SM
L2 Cache	--	50 MB	~200	~12 TB/s	GPU
Global Memory (HBM3)	--	80 GB	~400-600	3,350 GB/s	GPU + Host

Note: Register bandwidth is effectively unlimited for practical purposes since each thread accesses its own registers every cycle. Aggregate shared memory bandwidth assumes all SMs active.

The critical insight: register access is free (1 cycle, zero overhead), shared memory is cheap (~25 cycles), and global memory is expensive (~500 cycles). A kernel that can keep its working data in registers and shared memory will be compute-bound. A kernel that constantly reads from HBM will be memory-bound. The vast majority of kernels are memory-bound.

⚠️ The Roofline Reality

On an A100, the compute-to-bandwidth ratio for FP16 tensor ops is $312 \text{ TFLOPS} / 2039 \text{ GB/s} \approx 156 \text{ FLOP/byte}$ . On H100, it rises to $\sim 590 \text{ FLOP/byte}$ . Any kernel with arithmetic intensity below this threshold is memory-bound. For reference, elementwise operations have an intensity of ~1 FLOP/byte. Reductions: ~1 FLOP/byte. Layer normalization: ~5 FLOP/byte. Even matrix multiplication at small sizes can be memory-bound. Only large GEMMs consistently exceed the threshold.

Architecture Comparison: V100 vs A100 vs H100

Before diving into each memory level, here is how the memory subsystem has evolved across three generations:

📊

Memory Subsystem Evolution: V100, A100, H100

Spec	V100 (Volta)	A100 (Ampere)	H100 (Hopper)
SMs	80	108	132
Registers per SM	256 KB	256 KB	256 KB
Max shared memory per SM	96 KB	164 KB	228 KB
L1 cache per SM	128 KB (shared pool)	192 KB (shared pool)	256 KB (shared pool)
L2 cache total	6 MB	40 MB	50 MB
HBM type	HBM2	HBM2e	HBM3
HBM capacity	16-32 GB	40-80 GB	80 GB
HBM bandwidth	900 GB/s	2,039 GB/s	3,350 GB/s
Memory bus width	4096-bit	5120-bit	5120-bit
Async copy (cp.async)	No	Yes	Yes (enhanced TMA)
L2 residency control	No	Partial	Full

Key evolutionary trends: shared memory capacity has more than doubled (96 KB to 228 KB), L2 cache has grown 8x (6 MB to 50 MB), and HBM bandwidth has nearly 4x’d. Each generation makes the memory hierarchy deeper and wider, but the fundamental optimization principles remain the same. The biggest architectural addition is the Tensor Memory Accelerator (TMA) on Hopper, which offloads complex addressing patterns from the SM entirely.

Level 1: Registers — The Fastest Memory You Cannot See

The Basics

Every CUDA thread has access to a private set of registers. On all recent NVIDIA architectures (Volta, Ampere, Hopper), each SM has a 256 KB register file divided into 65,536 32-bit registers. These registers are allocated to threads at launch time and persist for the thread’s entire lifetime.

Registers are the fastest memory on the GPU:

Latency: 1 cycle — the same as an arithmetic instruction
Bandwidth: Effectively unlimited, since register reads/writes are part of the instruction pipeline
Scope: Private to each thread (no sharing, no synchronization needed)

Any local variable in your CUDA kernel that the compiler can keep in a register is free to access. The cost is zero. This is fundamentally different from CPU programming where register allocation is mostly invisible — on GPUs, register usage directly impacts how many threads can run simultaneously.

Register Pressure and the Occupancy Tradeoff

Here is the core tension: each thread that uses more registers reduces the number of threads that can coexist on the same SM. The 65,536 registers per SM are shared among all active threads. If your kernel uses 64 registers per thread and you launch blocks of 256 threads, that is $64 \times 256 = 16,384$ registers per block. The SM can fit at most $65,536 / 16,384 = 4$ blocks, for $4 \times 256 = 1,024$ active threads. The maximum is 2,048 threads per SM on Ampere/Hopper, so you are at 50% occupancy.

Occupancy vs Registers per Thread (256 threads/block, Ampere)

line

Metric	16	32	48	64	80	96	128	256
Occupancy (%)	100	100	66	50	37	33	25	12

Lower occupancy means fewer warps available for the warp scheduler to choose from, which reduces the GPU’s ability to hide memory latency. When a warp stalls waiting for a global memory load (400-600 cycles), the scheduler switches to another ready warp. With fewer warps, there may not be enough ready warps to keep the execution units busy during these stalls.

However, occupancy is not the only factor. A kernel with 50% occupancy that keeps all data in registers can outperform a kernel with 100% occupancy that repeatedly loads from global memory. This is the register pressure vs occupancy tradeoff, and there is no universal answer — it depends on your kernel’s arithmetic intensity and memory access patterns.

💡 The Practical Rule

Aim for at least 50% occupancy (or about 32+ active warps per SM). Below that, latency hiding becomes difficult. Above 50%, further occupancy gains typically yield diminishing returns. If you can keep critical data in registers at 50% occupancy instead of loading from shared or global memory at 100% occupancy, the register-heavy approach often wins.

Register Spilling: The Silent Catastrophe

When the compiler cannot fit all local variables into the available registers, it spills them to local memory. Despite the misleading name, local memory is not fast on-chip storage — it is physically located in global memory (HBM), backed by the L1 and L2 caches. A register spill turns a 1-cycle access into a potential 400-600 cycle access if it misses both caches.

You can detect register spilling in the compilation output:

# Compile with verbose output to see register usage and spills
nvcc -Xptxas -v my_kernel.cu
# Output includes lines like:
# ptxas info: Used 64 registers, 48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
# If "spill stores" or "spill loads" are non-zero, you have a problem.

⚠️ Spilling Is Catastrophic

Even a few spilled registers can devastate performance. Each spill generates a store to local memory and a later load from local memory. If the spilled value is in a hot loop, this happens on every iteration. We have measured 3-5x slowdowns from register spilling in production kernels. Always compile with -Xptxas -v and check for spills.

Controlling Register Allocation with __launch_bounds__

You can tell the compiler the maximum number of threads per block and optionally the minimum number of blocks per SM. The compiler uses this information to optimize register allocation:

// Tell the compiler: this kernel will be launched with at most 256 threads/block
// and we want at least 2 blocks per SM
__global__ void __launch_bounds__(256, 2)
my_kernel(float *input, float *output, int n) {
    // With 256 threads and min 2 blocks, the compiler knows it has at most
    // 65536 / (256 * 2) = 128 registers per thread to work with.
    // It will aggressively try to stay under this limit to avoid spilling.

    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float a = input[idx];
        float b = a * a + 3.14159f;
        float c = sqrtf(b) * a;
        output[idx] = c;
    }
}

Without __launch_bounds__, the compiler assumes worst-case scenarios and may use more registers than necessary (to avoid spills) or spill more than necessary (to maintain higher occupancy). Providing explicit bounds lets the compiler make better decisions.

// Another approach: directly limit registers per thread
// This overrides the compiler's register allocation decisions
__global__ void __launch_bounds__(256)
__maxnreg__(64)  // Hard limit: 64 registers per thread
my_kernel_constrained(float *data, int n) {
    // If the kernel needs more than 64 registers, the compiler WILL spill.
    // Use this only when you know the register budget precisely.
}

📊

Impact of __launch_bounds__ on a Real Kernel (Softmax, A100)

Configuration	Registers/Thread	Occupancy	Spills	Throughput
No launch bounds	96	33%	0 bytes	1,200 GB/s
__launch_bounds__(256, 2)	64	50%	0 bytes	1,550 GB/s
__launch_bounds__(256, 4)	32	100%	128 bytes	1,100 GB/s

Note: Pushing for too-high occupancy (4 blocks/SM) caused spilling, which destroyed performance despite 100% occupancy.

The sweet spot is clear: __launch_bounds__(256, 2) at 50% occupancy outperformed both the unconstrained version (too few warps for latency hiding) and the over-constrained version (spilling negated the occupancy gain).

Level 2: Shared Memory (SRAM) — The Programmer’s Scratchpad

Architecture and Sizing

Shared memory is fast, on-chip SRAM that sits physically within the SM. It is shared by all threads in a thread block, making it ideal for cooperative data access patterns. Unlike registers (private, implicit) or caches (automatic, hardware-managed), shared memory is explicitly managed by the programmer — you decide what goes in, when it goes in, and how it is laid out.

The capacity has grown significantly across generations:

V100 (Volta): Up to 96 KB per SM
A100 (Ampere): Up to 164 KB per SM (configurable, shared with L1)
H100 (Hopper): Up to 228 KB per SM (configurable, shared with L1)

On Ampere and Hopper, the on-chip memory is a unified pool that can be split between shared memory and L1 cache. You can configure the split per kernel:

// Request maximum shared memory for this kernel
cudaFuncSetAttribute(
    my_kernel,
    cudaFuncAttributeMaxDynamicSharedMemorySize,
    164 * 1024  // 164 KB on A100
);

// Or set the preferred carveout ratio
cudaFuncSetAttribute(
    my_kernel,
    cudaFuncAttributePreferredSharedMemoryCarveout,
    cudaSharedmemCarveoutMaxShared
);

📊

Shared Memory / L1 Cache Configurations (A100)

Config	Shared Memory	L1 Cache	Best For
Default	48 KB	144 KB	Most kernels -- streaming access benefits from large L1
Max SMEM	164 KB	28 KB	Large tiles: GEMM, FlashAttention, convolution
Max L1	28 KB	164 KB	Streaming kernels with poor data reuse

Note: FlashAttention uses 160KB+ of shared memory per block to hold Q, K, V, and output tiles simultaneously.

Why Shared Memory Matters: Data Reuse

The core value proposition is data reuse without HBM round-trips. Load data from global memory into shared memory once, then access it many times at much higher bandwidth (~3 TB/s aggregate vs ~2-3.3 TB/s peak HBM).

Kernel Performance: With vs Without Shared Memory Tiling (V100)

(GB/s effective bandwidth)

Matrix transpose (naive) Non-coalesced writes

120 GB/s effective bandwidth

Matrix transpose (SMEM tiled) 6.3x faster

750 GB/s effective bandwidth

Reduction (naive)

280 GB/s effective bandwidth

Reduction (SMEM) 2.9x faster

820 GB/s effective bandwidth

GEMM (naive)

45 GB/s effective bandwidth

GEMM (SMEM tiled) 15x faster

680 GB/s effective bandwidth

The GEMM result is the most striking: tiling with shared memory gives a 15x speedup because it converts $O(N)$ global memory accesses per output element into $O(N / \text{TILE\_SIZE})$ accesses, with the rest served from shared memory. Each element loaded from HBM is reused $\text{TILE\_SIZE}$ times, amortizing the cost of the slow global load.

Bank Conflicts: The Hidden Performance Killer

Shared memory is organized into 32 banks, each 4 bytes wide. In a single clock cycle, the hardware can service one request per bank. When 32 threads in a warp each access a different bank, all accesses proceed in parallel — full throughput. But when multiple threads access different addresses that map to the same bank, those accesses are serialized.

The bank for a given byte address is:

$\text{bank} = \left\lfloor \frac{\text{addr}}{4} \right\rfloor \bmod 32$

Two threads conflict when they access different 4-byte words in the same bank. One important exception: when multiple threads read the exact same address, the hardware broadcasts the value to all requesting threads at no extra cost.

📊

Bank Conflict Impact on Shared Memory Bandwidth (V100)

Conflict Degree	Access Pattern	Effective Bandwidth	Penalty
None (1-way)	Stride 1 -- consecutive	2.8 TB/s	1x (optimal)
2-way	Stride 2	1.4 TB/s	2x slower
4-way	Stride 4	0.7 TB/s	4x slower
8-way	Stride 8	0.35 TB/s	8x slower
32-way	Stride 32 (all same bank)	0.09 TB/s	32x slower -- full serialization

Note: A 32-way bank conflict means all 32 threads access the same bank sequentially. Shared memory becomes slower than L2 cache.

Let us look at concrete examples:

__shared__ float data[1024];

// NO bank conflicts: stride 1 (consecutive access)
// Thread 0 -> data[0] (bank 0), Thread 1 -> data[1] (bank 1), ...
// Thread 31 -> data[31] (bank 31). All 32 banks used, zero conflicts.
float val = data[threadIdx.x];

// 2-WAY bank conflicts: stride 2
// Thread 0 -> data[0] (bank 0), Thread 1 -> data[2] (bank 2), ...
// Thread 16 -> data[32] (bank 0 again!). Banks 0,2,4,...30 each hit twice.
float val = data[threadIdx.x * 2];

// 32-WAY bank conflicts: stride 32 (catastrophic!)
// Thread 0 -> data[0] (bank 0), Thread 1 -> data[32] (bank 0), ...
// ALL threads hit bank 0. Fully serialized.
float val = data[threadIdx.x * 32];

The Padding Trick

The most common bank conflict occurs when accessing columns of a 2D shared memory array. If you have a 32x32 tile and threads access it column-wise, every thread hits the same bank:

// PROBLEM: Column access on a 32-wide array causes 32-way bank conflicts
__shared__ float tile[32][32];
// Thread k reads tile[k][col] for some fixed col.
// Address of tile[k][col] = (k * 32 + col) * 4 bytes
// Bank = (k * 32 + col) % 32 = col (for all k!)
// All 32 threads hit bank 'col'. Full serialization.
float val = tile[threadIdx.x][some_col];

// FIX: Pad each row by 1 element
__shared__ float tile[32][33];  // 33 instead of 32!
// Address of tile[k][col] = (k * 33 + col) * 4 bytes
// Bank = (k * 33 + col) % 32 = (k + col) % 32
// Different k values -> different banks. Zero conflicts!
float val = tile[threadIdx.x][some_col];

The cost of padding is 32 extra floats (128 bytes) per tile — completely negligible. The benefit can be 30%+ speedup.

💡 The Universal Fix for Bank Conflicts

For any $N \times N$ shared memory array that will be accessed by column, declare it as float tile[N][N+1]. This eliminates all column-access bank conflicts at the cost of $N \times 4$ bytes of wasted padding. Works for any power-of-two N.

Detecting Bank Conflicts

Nsight Compute reports bank conflicts directly:

ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared \
    ./my_kernel

If l1tex__data_bank_conflicts_pipe_lsu_mem_shared is high (over 10% of shared memory accesses), you have a problem worth fixing.

Async Copy: cp.async (Ampere+)

Traditionally, loading data from global to shared memory requires two steps: load from global memory into a register, then store the register into shared memory. This occupies registers and serializes the load-store sequence.

Ampere introduced cp.async, which copies data directly from global memory to shared memory without going through registers. This has two benefits: it frees up registers, and it allows overlap of the copy with computation on previously loaded data.

// Traditional: Global -> Register -> Shared (2 steps, uses registers)
__shared__ float smem[BLOCK_SIZE];
smem[threadIdx.x] = global[idx];  // Implicit: load to register, then store to smem
__syncthreads();

// Async (Ampere+): Global -> Shared directly (bypasses registers)
#include <cuda/pipeline>

__shared__ float smem[BLOCK_SIZE];
auto pipe = cuda::make_pipeline();

pipe.producer_acquire();
cuda::memcpy_async(&smem[threadIdx.x], &global[idx], sizeof(float), pipe);
pipe.producer_commit();

// ... do other useful work while the copy is in flight ...

pipe.consumer_wait();  // Block until copy completes
__syncthreads();
// Now smem is ready to use

On Hopper, this is taken further with the Tensor Memory Accelerator (TMA), which can load multi-dimensional tiles from global memory to shared memory in a single instruction, handling all the address calculation and boundary checking in hardware.

📊

Async Copy Impact on Tiled Kernels (A100)

Kernel	Without cp.async	With cp.async	Speedup
GEMM (CUTLASS)	280 TFLOPS	295 TFLOPS	1.05x
FlashAttention-2	145 TFLOPS	156 TFLOPS	1.07x
Custom reduction	1,600 GB/s	1,750 GB/s	1.09x

Note: Async copy helps most when the kernel alternates between loading tiles and computing on them (double-buffering pattern).

Common Shared Memory Patterns

Tiling (general): Load a tile of data from global memory into shared memory, process it, then move to the next tile. The key is that each element loaded from global memory is accessed multiple times from shared memory. GEMM is the canonical example: each element of the A and B tiles is used $\text{TILE\_SIZE}$ times.

Reduction: Load N elements into shared memory, reduce in parallel with $\log_2(N)$ steps. Each step halves the active threads. Use sequential addressing (not interleaved) to minimize bank conflicts in later steps and avoid warp divergence.

Stencil / convolution: Load a tile plus a halo region into shared memory. Each thread computes its output from the cached neighborhood. This avoids redundant global memory loads for overlapping receptive fields.

Histogram: Use shared memory for per-block partial histograms, then reduce across blocks in a second kernel. This avoids expensive atomic contention on global memory.

Level 3: L1 and L2 Caches

L1 Cache

The L1 cache is a per-SM hardware-managed cache. On Ampere and Hopper, it shares a physical SRAM pool with shared memory (the configurable split we discussed above). Key characteristics:

Size: Up to 128-256 KB per SM (depending on shared memory configuration)
Line size: 128 bytes (32 floats)
Latency: ~30-40 cycles
Scope: Per-SM — only threads on the same SM benefit from cached data

The L1 cache is primarily useful for:

Register spills: Local memory (spilled registers) is cached in L1
Stack frames: Function call frames and local arrays go through L1
Texture loads: The texture pipeline uses L1 for spatial locality caching
Global loads (when enabled): On Volta+, global loads can optionally go through L1

On Ampere, the L1 cache also serves as the staging area for cp.async operations.

L2 Cache

The L2 cache is a GPU-wide shared cache. All global memory traffic passes through L2 — there is no way to bypass it. This makes it the most important cache on the GPU.

Size: 6 MB (V100), 40 MB (A100), 50 MB (H100)
Line size: 128 bytes
Latency: ~200 cycles
Bandwidth: Much higher than HBM — up to ~12 TB/s on H100
Scope: Entire GPU — all SMs share the same L2

The L2 cache matters enormously for kernels that have temporal locality — when multiple SMs (or multiple kernel launches) access the same data. Examples include:

KV cache in LLM inference: The same cached keys/values are read by each attention head
Embedding tables: Popular embeddings are accessed frequently
Multi-pass algorithms: Kernel fusion cannot always eliminate intermediate buffers

📊

L2 Cache Impact by Working Set Size (A100, streaming read kernel)

Working Set	Fits in L2?	L2 Hit Rate	Effective Bandwidth	Relative Perf
4 MB	Yes	~98%	~5,000 GB/s	2.8x
20 MB	Yes	~90%	~4,000 GB/s	2.2x
40 MB	Barely	~70%	~2,500 GB/s	1.4x
200 MB	No	~5%	~1,800 GB/s	1.0x (baseline)
2 GB (random)	No	~1%	~300 GB/s	0.17x

L2 Residency Control (Ampere and Hopper)

Starting with Ampere, NVIDIA provides APIs to hint which data should persist in L2 and which should stream through:

// Tell the hardware: keep kv_cache in L2 as long as possible
cudaAccessPolicyWindow window = {};
window.base_ptr = kv_cache_ptr;
window.num_bytes = kv_cache_size;    // Must be <= L2 size for full residency
window.hitRatio = 1.0f;             // Try to keep 100% of this data in L2
window.hitProp = cudaAccessPropertyPersisting;   // Persist in L2
window.missProp = cudaAccessPropertyStreaming;    // Evict non-critical data
cudaCtxSetAccessPolicyWindow(&window);

On Hopper, this is enhanced with more granular control. You can partition the L2 into regions dedicated to specific data, ensuring that high-value data (like KV caches) is never evicted by streaming loads from other kernels.

ℹ️ Cache Line Size Matters for Coalescing

The L1 and L2 caches both use 128-byte cache lines. When a warp issues a global memory load, the hardware fetches entire 128-byte lines. If 32 threads each load a consecutive 4-byte float, that is exactly $32 \times 4 = 128$ bytes = 1 cache line = 1 memory transaction. If threads load non-consecutive addresses, each unique cache line touched requires a separate transaction. This is why coalescing matters so much — it is fundamentally about minimizing the number of cache lines fetched.

Level 4: Global Memory (HBM) — The Bandwidth Bottleneck

HBM Basics

Global memory is the large off-chip DRAM attached to the GPU via High Bandwidth Memory (HBM) stacks. It is the only memory visible to the host (CPU) and the primary storage for all kernel data.

Capacity: 16-80 GB depending on GPU model
Technology: HBM2 (V100), HBM2e (A100), HBM3 (H100)
Peak bandwidth: 900 GB/s (V100), 2,039 GB/s (A100), 3,350 GB/s (H100)
Latency: 400-600 cycles — the GPU hides this with massive thread parallelism
Bus width: 4096-5120 bits — wide bus, but still the bottleneck for most kernels

The latency of 400-600 cycles sounds devastating, but GPUs are designed to tolerate it. With thousands of active threads per SM, the warp scheduler can switch to a ready warp whenever the current warp stalls on a memory access. This is why occupancy matters — enough active warps must exist to cover the latency gap.

The math: if a memory access takes 500 cycles and the SM can issue 1 instruction per cycle, you need at least 500 instructions (from other warps) in flight to fully hide the latency. With 32 threads per warp and, say, 8 instructions per warp between memory accesses, you need roughly $500 / 8 \approx 63$ warps to keep the SM fully utilized. That is why a single SM supports up to 64 warps (2,048 threads).

Memory Coalescing: THE Most Important Optimization

Memory coalescing is the single most impactful optimization for GPU kernels. Getting it right can mean 10-20x performance differences. Getting it wrong means your kernel runs at a fraction of peak bandwidth.

How coalescing works: When threads in a warp execute a load or store instruction, the hardware combines (coalesces) their individual memory requests into the minimum number of 128-byte cache line transactions. The best case: all 32 threads access consecutive 4-byte addresses aligned to a 128-byte boundary. This produces exactly 1 transaction. The worst case: all 32 threads access addresses in different cache lines, producing up to 32 transactions.

// PERFECT COALESCING: 1 transaction per warp
// Thread 0 -> data[0], Thread 1 -> data[1], ..., Thread 31 -> data[31]
// All addresses in one 128-byte line: data[0..31] = 128 bytes
__global__ void coalesced(float *data, float *out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = data[idx] * 2.0f;  // Consecutive threads, consecutive addresses
}

// STRIDED ACCESS: stride transactions per warp
// Thread 0 -> data[0], Thread 1 -> data[stride], Thread 2 -> data[2*stride], ...
__global__ void strided(float *data, float *out, int n, int stride) {
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    out[idx] = data[idx] * 2.0f;  // Stride-N access: each thread hits a different cache line
}

// RANDOM ACCESS: up to 32 transactions per warp
// Thread 0 -> data[random[0]], Thread 1 -> data[random[1]], ...
__global__ void random_access(float *data, float *out, int *indices, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = data[indices[idx]] * 2.0f;  // Random: worst case 32 transactions
}

📊

Global Memory Bandwidth: Coalescing Impact (A100)

Access Pattern	Transactions per Warp	Achieved Bandwidth	% of Peak
Coalesced (stride 1)	1	~1,800 GB/s	88%
Stride 2	2	~900 GB/s	44%
Stride 4	4	~450 GB/s	22%
Stride 8	8	~230 GB/s	11%
Stride 32	32	~60 GB/s	3%
Random (scattered)	~28 avg	~70 GB/s	3.5%

Note: Each doubling of stride halves effective bandwidth. At stride 32, we are using only 3% of peak -- 97% of memory bus capacity is wasted fetching bytes we do not need.

The numbers are stark. Coalesced access achieves 88% of peak bandwidth. Random access achieves 3.5%. That is a 25x difference from a single optimization.

AoS vs SoA: The Data Layout That Determines Performance

The most common coalescing failure in real codebases comes from Array-of-Structures (AoS) vs Structure-of-Arrays (SoA) layout decisions.

AoS vs SoA Memory Layout for Particles

// AoS: Array of Structures -- natural for CPU, terrible for GPU
struct Particle {
    float x, y, z, mass;
};
Particle *particles;  // particles[i].x, particles[i].y, ...

__global__ void update_positions_aos(Particle *p, float dt, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        // Thread 0 reads p[0].x at address 0
        // Thread 1 reads p[1].x at address 16 (sizeof(Particle) = 16)
        // Stride of 4 floats! Only 25% of fetched bytes are useful.
        p[i].x += p[i].vx * dt;
    }
}

// SoA: Structure of Arrays -- optimal for GPU
struct ParticlesSoA {
    float *x, *y, *z, *mass;
    float *vx, *vy, *vz;
};

__global__ void update_positions_soa(ParticlesSoA p, float dt, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        // Thread 0 reads x[0] at address 0
        // Thread 1 reads x[1] at address 4
        // Perfect stride-1 access! 100% of fetched bytes are useful.
        p.x[i] += p.vx[i] * dt;
    }
}

📊

AoS vs SoA Performance (Particle update, 10M particles, A100)

Layout	Achieved Bandwidth	Useful Bandwidth	Speedup
AoS (stride 4)	~1,600 GB/s total	~400 GB/s useful	1x
SoA (stride 1)	~1,700 GB/s total	~1,700 GB/s useful	4x

Note: AoS achieves decent raw bandwidth but wastes 75% of it fetching struct fields the kernel does not need. SoA delivers 4x more useful bytes.

⚠️ AoS Is the Silent Killer

Array-of-Structures layout is the default in most C/C++ codebases because it is natural and cache-friendly on CPUs. On GPUs, it causes strided access patterns that waste 50-75% or more of memory bandwidth. When porting CPU code to CUDA, converting AoS to SoA is often the single highest-impact change you can make. Always default to SoA for GPU data structures.

Vectorized Loads: float4

Loading 4 floats at once (128 bits) instead of 1 float (32 bits) reduces instruction count and improves memory throughput:

// Scalar: each thread loads 1 float (4 bytes)
__global__ void kernel_scalar(float *in, float *out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = in[idx] * 2.0f;
}

// Vectorized: each thread loads 4 floats (16 bytes)
__global__ void kernel_vec4(float4 *in, float4 *out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float4 val = in[idx];
    val.x *= 2.0f; val.y *= 2.0f; val.z *= 2.0f; val.w *= 2.0f;
    out[idx] = val;
}

Bandwidth by Load Width (A100, elementwise kernel)

(GB/s)

float (32-bit)

1,420 GB/s

float2 (64-bit)

1,650 GB/s

float4 (128-bit) Best -- matches 128-byte cache line

1,780 GB/s

💡 When to Use float4

Use float4 loads whenever your data is 16-byte aligned and you process 4 contiguous elements per thread. This matches the GPU’s natural memory transaction width and reduces the number of load/store instructions by 4x. Most elementwise kernels (activations, normalization, residual connections) benefit from vectorization. The caveat: your array length must be divisible by 4, or you need a scalar tail loop.

The Canonical Example: Matrix Transpose

Matrix transpose is the textbook example of memory optimization because the naive version has fundamentally non-coalesced writes, and shared memory tiling fixes this elegantly.

The problem: Reading row-major input by rows is coalesced (consecutive threads read consecutive columns within the same row). But writing the transposed output by columns is non-coalesced (consecutive threads write to addresses that are N elements apart, where N is the matrix width).

// Naive transpose: coalesced reads, NON-coalesced writes
__global__ void transpose_naive(float *out, const float *in, int N) {
    int x = blockIdx.x * 32 + threadIdx.x;  // column
    int y = blockIdx.y * 32 + threadIdx.y;  // row
    if (x < N && y < N) {
        // Read:  in[y * N + x] -- consecutive x for consecutive threads -> coalesced
        // Write: out[x * N + y] -- consecutive x means stride-N writes -> NOT coalesced!
        out[x * N + y] = in[y * N + x];
    }
}

The fix: Use shared memory as a transposition buffer. Load a tile with coalesced reads, transpose it in shared memory (fast, on-chip), then write the transposed tile with coalesced writes.

// Tiled transpose: both reads AND writes are coalesced
__global__ void transpose_tiled(float *out, const float *in, int N) {
    __shared__ float tile[32][33];  // Padded to avoid bank conflicts!

    // Phase 1: Coalesced read from input into shared memory tile
    int x_in = blockIdx.x * 32 + threadIdx.x;
    int y_in = blockIdx.y * 32 + threadIdx.y;
    if (x_in < N && y_in < N) {
        tile[threadIdx.y][threadIdx.x] = in[y_in * N + x_in];
    }
    __syncthreads();

    // Phase 2: Coalesced write from shared memory tile to output
    // Note the swapped block indices!
    int x_out = blockIdx.y * 32 + threadIdx.x;
    int y_out = blockIdx.x * 32 + threadIdx.y;
    if (x_out < N && y_out < N) {
        // Read tile[threadIdx.x][threadIdx.y] -- column read, but padded so no bank conflict
        // Write out[y_out * N + x_out] -- consecutive threadIdx.x -> consecutive x_out -> coalesced!
        out[y_out * N + x_out] = tile[threadIdx.x][threadIdx.y];
    }
}

The trick is subtle but powerful: by swapping blockIdx.x and blockIdx.y for the output coordinates, consecutive threads now write to consecutive output addresses. The transposition happens in shared memory, where the access pattern does not matter for coalescing (shared memory has no coalescing requirement — only bank conflicts matter, which we handle with padding).

📊

Matrix Transpose Performance (4096x4096, V100)

Implementation	Bandwidth (GB/s)	Efficiency	Key Detail
Naive (non-coalesced write)	120	13%	Stride-N writes destroy coalescing
SMEM tiled (no padding)	580	64%	Coalesced R/W but 32-way bank conflicts
SMEM tiled + padding [32][33]	750	83%	Full coalescing, zero bank conflicts
V100 theoretical peak	900	100%	Hardware limit

Note: The padding alone gives a 29% speedup over the unpadded tiled version. Total speedup vs naive: 6.3x.

The Optimization Hierarchy: What to Fix First

Not all memory optimizations are equal. Here is the priority order, with approximate impact ranges:

Cumulative Impact of Memory Optimizations (starting from naive kernel)

(GB/s achieved bandwidth (A100))

1. Naive (uncoalesced, scalar) 10% of peak

200 GB/s achieved bandwidth (A100)

2. Fix coalescing (SoA, stride-1) +7.5x -- THE critical fix

1,500 GB/s achieved bandwidth (A100)

3. Add shared memory tiling +1.13x for reuse cases

1,700 GB/s achieved bandwidth (A100)

4. Vectorized loads (float4) +1.06x

1,800 GB/s achieved bandwidth (A100)

5. Fix bank conflicts (padding) +1.03x (in tiled kernels)

1,850 GB/s achieved bandwidth (A100)

6. Tune registers (launch_bounds) +1.02x

1,880 GB/s achieved bandwidth (A100)

A100 HBM peak Hardware limit

2,039 GB/s achieved bandwidth (A100)

The message is clear:

Fix coalescing first (~7-10x impact). Convert AoS to SoA. Ensure consecutive threads access consecutive memory addresses. This alone gets you from 10% to 75%+ of peak bandwidth.
Add shared memory tiling (~2-15x for kernels with data reuse). If each element loaded from global memory is used more than once, tiling through shared memory eliminates redundant global loads. The benefit scales with the reuse factor.
Vectorize loads (~1.1-1.25x). Use float4 where possible. Easy to implement, modest but consistent gain.
Fix bank conflicts (~1.1-1.5x in shared memory-heavy kernels). Add padding to 2D shared memory arrays.
Tune register usage (~1.0-1.3x). Use __launch_bounds__ to control occupancy. Eliminate spills. This is a fine-tuning step.

ℹ️ Do Not Optimize in the Wrong Order

We have seen engineers spend days optimizing bank conflicts in shared memory while their kernel has a fundamental coalescing problem in global memory. Bank conflict optimization is worthless if the kernel is bottlenecked on non-coalesced global loads. Always profile first, identify the actual bottleneck, then optimize in priority order.

Practical Profiling with Nsight Compute

Nsight Compute (ncu) is the definitive tool for understanding GPU memory behavior. Here are the key metrics and what they tell you.

Step 1: Is Your Kernel Memory-Bound or Compute-Bound?

ncu --metrics \
  sm__throughput.avg.pct_of_peak_sustained_active,\
  gpu__dram_throughput.avg.pct_of_peak_sustained_active \
  ./my_kernel

If gpu__dram_throughput is > 60% and sm__throughput is < 40%: memory-bound. Optimize memory access patterns.
If sm__throughput is > 60% and gpu__dram_throughput is < 40%: compute-bound. Optimize arithmetic.
If both are low: you have a latency problem (insufficient occupancy, synchronization overhead, or kernel launch overhead).

Step 2: Check Coalescing

ncu --metrics \
  l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,\
  l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum \
  ./my_kernel

The ratio sectors / requests tells you the average number of 32-byte sectors fetched per load request. For perfectly coalesced float loads, each warp request should generate exactly 4 sectors (128 bytes / 32 bytes per sector = 4). If the ratio is much higher (e.g., 16 or 32), you have a coalescing problem.

Ideal: sectors/requests $\approx$ 4 for 4-byte types, 8 for 8-byte types

Bad: sectors/requests $\approx$ 32 means essentially every thread is accessing a different cache line

Step 3: Check Memory Throughput

ncu --metrics \
  dram__bytes_read.sum,\
  dram__bytes_write.sum,\
  gpu__time_duration.sum \
  ./my_kernel

Calculate achieved bandwidth:

$\text{Achieved BW} = \frac{\text{bytes\_read} + \text{bytes\_write}}{\text{time\_duration}}$

Compare this to peak HBM bandwidth. Targets:

📊

Bandwidth Efficiency Assessment

Achieved % of Peak	Assessment	Action
over 80%	Excellent -- near hardware limit	Move on to compute optimization or algorithmic improvements
60-80%	Good -- minor improvements possible	Check vectorized loads, alignment, L2 utilization
30-60%	Mediocre -- significant waste	Likely coalescing issues or L2 thrashing
under 30%	Poor -- major access pattern problem	Fix coalescing first. Check for AoS layout or strided access.

Step 4: Check Shared Memory Efficiency

ncu --metrics \
  l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum,\
  l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum,\
  l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum \
  ./my_kernel

If bank conflicts are > 10% of wavefronts, apply the padding trick to your shared memory arrays.

Step 5: Check for Register Spilling

# At compile time:
nvcc -Xptxas -v my_kernel.cu 2>&1 | grep "spill"

# At runtime with ncu:
ncu --metrics \
  l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum,\
  l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum \
  ./my_kernel

If local memory load/store sectors are non-zero, the kernel is spilling registers. Add __launch_bounds__ or reduce per-thread register pressure.

Putting It All Together: A Real Optimization Walkthrough

Let us walk through optimizing a softmax kernel as a concrete example. Softmax computes $\text{softmax}(x_i) = e^{x_i - \max(x)} / \sum_j e^{x_j - \max(x)}$ for each row of an input matrix. It requires three passes over each row: find the max, compute the exponentials and sum, normalize.

Version 1: Naive

__global__ void softmax_naive(float *input, float *output, int rows, int cols) {
    int row = blockIdx.x;
    if (row >= rows) return;

    float *in_row = input + row * cols;
    float *out_row = output + row * cols;

    // Pass 1: find max (thread 0 only -- terrible!)
    float max_val = -INFINITY;
    if (threadIdx.x == 0) {
        for (int j = 0; j < cols; j++) {
            max_val = fmaxf(max_val, in_row[j]);
        }
    }
    __shared__ float s_max;
    if (threadIdx.x == 0) s_max = max_val;
    __syncthreads();

    // Pass 2: exp and sum (each thread handles one element -- poor reuse)
    float sum = 0.0f;
    for (int j = threadIdx.x; j < cols; j += blockDim.x) {
        float val = expf(in_row[j] - s_max);
        out_row[j] = val;
        sum += val;
    }

    // Reduce sum across threads (naive: serial atomic)
    __shared__ float s_sum;
    if (threadIdx.x == 0) s_sum = 0.0f;
    __syncthreads();
    atomicAdd(&s_sum, sum);
    __syncthreads();

    // Pass 3: normalize
    for (int j = threadIdx.x; j < cols; j += blockDim.x) {
        out_row[j] /= s_sum;
    }
}

Problems: Single-thread max reduction, 3 passes over global memory, atomic reduction, no vectorization.

Version 2: Optimized with Shared Memory, Warp Reductions, float4

__global__ void __launch_bounds__(256, 2)
softmax_optimized(float *input, float *output, int rows, int cols) {
    int row = blockIdx.x;
    if (row >= rows) return;

    const int tid = threadIdx.x;
    const int BLOCK = blockDim.x;  // 256
    const int vec_cols = cols / 4;

    float4 *in_row = reinterpret_cast<float4*>(input + row * cols);
    float4 *out_row = reinterpret_cast<float4*>(output + row * cols);

    // Pass 1: find row max (all threads participate, vectorized loads)
    float local_max = -INFINITY;
    for (int j = tid; j < vec_cols; j += BLOCK) {
        float4 v = in_row[j];  // Coalesced float4 load
        local_max = fmaxf(local_max, fmaxf(fmaxf(v.x, v.y), fmaxf(v.z, v.w)));
    }

    // Warp-level reduction for max (no shared memory needed!)
    for (int offset = 16; offset > 0; offset >>= 1) {
        local_max = fmaxf(local_max, __shfl_down_sync(0xffffffff, local_max, offset));
    }

    // Block-level reduction via shared memory
    __shared__ float s_warp_max[8];  // 256 threads / 32 = 8 warps
    int warp_id = tid / 32;
    int lane = tid % 32;
    if (lane == 0) s_warp_max[warp_id] = local_max;
    __syncthreads();

    float row_max;
    if (tid < 8) {
        row_max = s_warp_max[tid];
        for (int offset = 4; offset > 0; offset >>= 1) {
            row_max = fmaxf(row_max, __shfl_down_sync(0xff, row_max, offset));
        }
        if (tid == 0) s_warp_max[0] = row_max;
    }
    __syncthreads();
    row_max = s_warp_max[0];

    // Pass 2: compute exp(x - max) and partial sums (vectorized)
    float local_sum = 0.0f;
    for (int j = tid; j < vec_cols; j += BLOCK) {
        float4 v = in_row[j];
        v.x = expf(v.x - row_max);
        v.y = expf(v.y - row_max);
        v.z = expf(v.z - row_max);
        v.w = expf(v.w - row_max);
        out_row[j] = v;  // Coalesced float4 store
        local_sum += v.x + v.y + v.z + v.w;
    }

    // Reduce sum (same warp + shared memory pattern as above)
    for (int offset = 16; offset > 0; offset >>= 1) {
        local_sum += __shfl_down_sync(0xffffffff, local_sum, offset);
    }
    __shared__ float s_warp_sum[8];
    if (lane == 0) s_warp_sum[warp_id] = local_sum;
    __syncthreads();

    float row_sum;
    if (tid < 8) {
        row_sum = s_warp_sum[tid];
        for (int offset = 4; offset > 0; offset >>= 1) {
            row_sum += __shfl_down_sync(0xff, row_sum, offset);
        }
        if (tid == 0) s_warp_sum[0] = row_sum;
    }
    __syncthreads();
    float inv_sum = 1.0f / s_warp_sum[0];

    // Pass 3: normalize (vectorized)
    for (int j = tid; j < vec_cols; j += BLOCK) {
        float4 v = out_row[j];
        v.x *= inv_sum;
        v.y *= inv_sum;
        v.z *= inv_sum;
        v.w *= inv_sum;
        out_row[j] = v;
    }
}

📊

Softmax Optimization Results (1024x4096, A100)

Version	Time (us)	Bandwidth (GB/s)	Technique
Naive	89	380	Serial max, atomics, scalar loads
+ Parallel max	42	800	All threads participate in reduction
+ Warp shuffles	35	960	Eliminate shared memory for intra-warp reduction
+ float4 loads	28	1,200	4x fewer load instructions
+ __launch_bounds__	25	1,340	Better register allocation, 50% occupancy
PyTorch built-in	27	1,240	Reference (CUB-based)

Note: Our optimized version slightly outperforms PyTorch's built-in softmax by using float4 vectorization and tuned launch bounds.

The optimized kernel achieves 1,340 GB/s on an A100 with 2,039 GB/s peak — 66% of peak bandwidth. This is close to the practical limit for a 3-pass algorithm (the data is read twice and written twice, so the effective data movement is ~4x the output size). The remaining gap is from instruction overhead and imperfect occupancy.

Texture and Constant Memory

These are specialized memory types worth mentioning briefly.

Constant Memory

Constant memory is a 64 KB read-only region cached in a dedicated constant cache. Its key property: when all threads in a warp read the same address, the value is broadcast to all threads in a single cycle. When threads read different addresses, accesses are serialized.

__constant__ float filter_weights[256];

__global__ void apply_filter(float *data, float *output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // All threads read the same weight -- broadcast, very fast
    float w = filter_weights[0];
    output[idx] = data[idx] * w;
}

Use case: Kernel parameters, lookup tables, filter coefficients — any small, read-only data that all threads access uniformly. If threads access different addresses, constant memory is slower than global memory due to serialization.

Texture Memory

Texture memory provides cached read-only access optimized for 2D spatial locality. The texture cache is separate from L1/L2 and uses a space-filling curve layout to keep 2D-neighboring elements close in cache.

// Modern CUDA uses texture objects (not the deprecated texture references)
cudaTextureObject_t tex;
// ... setup code ...

__global__ void texture_kernel(cudaTextureObject_t tex, float *output, int w, int h) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    // Hardware handles boundary conditions and interpolation
    float val = tex2D<float>(tex, x + 0.5f, y + 0.5f);
    output[y * w + x] = val;
}

Use case: Image processing, volumetric rendering, any workload with 2D or 3D spatial access patterns. For linear algebra and ML workloads, texture memory is rarely useful — global memory with proper coalescing is faster.

Common Anti-Patterns and Fixes

📊

Common Memory Anti-Patterns and Their Fixes

Anti-Pattern	Achieved BW	Fix	Fixed BW	Speedup
AoS layout	~400 GB/s	Convert to SoA	~1,600 GB/s	4x
Column-major access of row-major data	~200 GB/s	Transpose or tile through SMEM	~1,500 GB/s	7.5x
Scalar loads (float)	~1,400 GB/s	Vectorize to float4	~1,750 GB/s	1.25x
Misaligned base address	~1,400 GB/s	Align allocations to 128 bytes	~1,750 GB/s	1.25x
Small irregular gathers	~100 GB/s	Sort indices + batch into coalesced runs	~800 GB/s	8x
32-way SMEM bank conflict	~0.09 TB/s	Pad array: [N][N+1]	~2.8 TB/s	31x
Register spilling	Varies	__launch_bounds__, reduce local vars	Varies	1.5-5x

Summary

GPU memory optimization is not a dark art — it follows a clear, measurable hierarchy of priorities:

Understand the hierarchy. Registers (free) > Shared memory (cheap) > L1/L2 cache (moderate) > Global memory (expensive). The goal is to minimize traffic to the expensive levels.
Fix coalescing first. This is the 10x optimization. Ensure consecutive threads access consecutive addresses. Convert AoS to SoA. This single change often transforms a kernel from 10% to 80% of peak bandwidth.
Add shared memory tiling where there is data reuse. If each global memory element is used more than once, load it into shared memory once and reuse it. Pad arrays as [N][N+1] to eliminate bank conflicts.
Vectorize loads. Use float4 for aligned, contiguous data. This reduces instruction count and improves memory throughput.
Tune register allocation. Use __launch_bounds__ to control occupancy. Check for and eliminate register spilling. This is a fine-tuning step, not a first step.
Profile with Nsight Compute. Do not guess. Measure sectors/requests for coalescing, check DRAM throughput percentage, look at bank conflicts, verify absence of register spills. The profiler tells you exactly where the bottleneck is.
Think about cache. Size your working set to fit in L2 when possible. Use L2 persistence hints for hot data on Ampere+. Be aware that the 128-byte cache line size means every misaligned or scattered access wastes cache capacity.

The memory hierarchy has not changed in principle across three generations of GPUs. HBM gets faster, caches get larger, shared memory grows — but the relative gaps remain. Registers are still ~1000x faster than HBM. Coalescing still determines whether you use 3% or 90% of peak bandwidth. The engineer who understands memory will always write faster kernels than the engineer who does not.