GPU Shared Memory Optimization: Bank Conflicts, Tiling, and Access Patterns

A naive matrix transpose kernel achieves 120 GB/s on a V100 — 13% of HBM peak — because half the writes are non-coalesced. Tile the transpose through shared memory and you hit 750 GB/s — 6.3x faster. A naive GEMM hits 45 GB/s. Tile through shared memory and you hit 680 GB/s — 15x faster. Shared memory is the 3 TB/s scratchpad between registers and HBM that transforms memory-bound kernels into compute-bound ones, but only if you avoid bank conflicts and structure accesses correctly.

Why Shared Memory Matters

The core value proposition: data reuse without HBM round-trips. Load data from global memory into shared memory once, then access it many times at 3x the bandwidth.

Kernel Performance: With vs Without Shared Memory Tiling

(GB/s effective bandwidth)

Matrix transpose (naive) Non-coalesced writes

120 GB/s effective bandwidth

Matrix transpose (SMEM tiled) 6.3x faster

750 GB/s effective bandwidth

Reduction (naive)

280 GB/s effective bandwidth

Reduction (SMEM) 2.9x faster

820 GB/s effective bandwidth

GEMM (naive)

45 GB/s effective bandwidth

GEMM (SMEM tiled) 15x faster

680 GB/s effective bandwidth

The GEMM result is striking: tiling with shared memory gives 15x speedup because it converts O(N) global memory accesses per output element into O(N/TILE) accesses, with the rest served from SMEM.

Bank Conflicts: The Hidden Performance Killer

Shared memory is organized into 32 banks, each 4 bytes wide. In a single cycle, 32 threads in a warp can each access a different bank simultaneously. But when multiple threads access the same bank (different addresses within that bank), accesses are serialized.

📊

Bank Conflict Impact on Shared Memory Bandwidth (V100)

Conflict Degree	Access Pattern	Effective Bandwidth	Penalty
None (1-way)	Stride 1 -- consecutive	2.8 TB/s	1x (optimal)
2-way	Stride 2	1.4 TB/s	2x slower
4-way	Stride 4	0.7 TB/s	4x slower
8-way	Stride 8	0.35 TB/s	8x slower
32-way	Stride 32 (all same bank)	0.09 TB/s	32x slower

Note: A 32-way bank conflict serializes all 32 threads -- shared memory becomes slower than L2 cache.

The bank for address addr is: bank = (addr / 4) % 32. Two threads conflict when they access different 4-byte words in the same bank.

Detecting Bank Conflicts

# Nsight Compute reports bank conflicts directly
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared \
    ./my_kernel

If l1tex__data_bank_conflicts_pipe_lsu_mem_shared is high (over 10% of shared memory accesses), you have a bank conflict problem.

Fixing Bank Conflicts: The Padding Trick

The most common bank conflict occurs when accessing columns of a 2D shared memory array:

// PROBLEM: Column access causes 32-way bank conflicts
__shared__ float tile[32][32];
float val = tile[threadIdx.x][col];  // All threads hit bank (col*32*4/4)%32 = col

// FIX: Pad each row by 1 element
__shared__ float tile[32][33];  // 33 instead of 32
float val = tile[threadIdx.x][col];  // Bank = ((col + threadIdx.x*33)*4/4)%32 -- all different!

The single extra element per row shifts each row’s alignment, breaking the bank conflict pattern. Cost: 32 extra floats (128 bytes) per tile — negligible.

💡 The Universal Fix

For any NxN shared memory array accessed by column, declare it as [N][N+1]. This eliminates all column-access bank conflicts at the cost of ~Nx4 bytes of padding.

Tiling Strategy: Matrix Transpose Example

Matrix transpose is the canonical shared memory example because naive transpose has non-coalesced writes:

// Naive: coalesced reads, non-coalesced writes (strided)
__global__ void transpose_naive(float *out, float *in, int N) {
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    out[x * N + y] = in[y * N + x];  // Write stride = N (non-coalesced!)
}

// Tiled: use shared memory to convert non-coalesced writes to coalesced
__global__ void transpose_tiled(float *out, float *in, int N) {
    __shared__ float tile[32][33];  // Padded to avoid bank conflicts
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;

    tile[threadIdx.y][threadIdx.x] = in[y * N + x];  // Coalesced read
    __syncthreads();

    x = blockIdx.y * 32 + threadIdx.x;  // Note: swapped block indices
    y = blockIdx.x * 32 + threadIdx.y;
    out[y * N + x] = tile[threadIdx.x][threadIdx.y];  // Coalesced write!
}

The trick: load a 32x32 tile from global memory (coalesced reads), transpose it in shared memory (fast, no bank conflicts with padding), then write it back (coalesced writes). Both reads and writes are now coalesced.

📊

Matrix Transpose Performance (4096x4096, V100)

Implementation	Bandwidth (GB/s)	Efficiency	Bank Conflicts
Naive (non-coalesced write)	120	13%	N/A
SMEM tiled (no padding)	580	64%	32-way on writes
SMEM tiled + padding	750	83%	None
V100 theoretical peak	900	100%	--

The padding alone gives a 29% speedup over the unpadded tiled version. Combined with tiling, total speedup vs naive is 6.3x.

Shared Memory Configuration

On Ampere+ GPUs, shared memory and L1 cache share a configurable 192KB pool per SM:

📊

Shared Memory Configurations (A100)

Config	Shared Memory	L1 Cache	Best For
Default	48 KB	144 KB	Most kernels
Max SMEM	164 KB	28 KB	Large tiles (GEMM, FlashAttention)
Max L1	28 KB	164 KB	Streaming kernels with poor reuse

Note: Set with cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, bytes)

FlashAttention uses 160KB+ of shared memory per block to hold Q, K, V, and output tiles simultaneously — it explicitly requests maximum shared memory configuration.

ℹ️ The Occupancy Trade-off

Using more shared memory per block reduces the number of concurrent blocks per SM (lower occupancy). This is often worthwhile: a kernel running at 50% occupancy with large SMEM tiles can outperform 100% occupancy with no tiling, because the bandwidth savings from data reuse outweigh the loss of latency hiding.

Common Patterns

Reduction: Load N elements into shared memory, reduce in parallel with log(N) steps. Each step halves the active threads. Use sequential addressing (not interleaved) to avoid bank conflicts in later steps.

Stencil/convolution: Load a tile plus halo region into shared memory. Each thread computes its output from the cached neighborhood. Avoids redundant global memory loads for overlapping regions.

Histogram: Use shared memory for per-block partial histograms, then reduce across blocks. Avoids atomic contention on global memory.

Conclusion

Shared memory optimization follows a clear pattern: identify redundant global memory accesses, load data into SMEM tiles, pad arrays to eliminate bank conflicts, and synchronize threads between load and compute phases. The payoff ranges from 3x (reductions) to 15x (GEMM tiling) depending on the data reuse factor. Always verify with Nsight Compute that bank conflicts are eliminated — a single 32-way conflict can negate the entire benefit of tiling.

Why Shared Memory Matters

Kernel Performance: With vs Without Shared Memory Tiling

Bank Conflicts: The Hidden Performance Killer

Bank Conflict Impact on Shared Memory Bandwidth (V100)

Detecting Bank Conflicts

Fixing Bank Conflicts: The Padding Trick

Tiling Strategy: Matrix Transpose Example

Matrix Transpose Performance (4096x4096, V100)

Shared Memory Configuration

Shared Memory Configurations (A100)

Common Patterns

Conclusion

Stanley Phoong

Related Posts

Matrix Transpose: The Canonical CUDA Optimization Problem — From Naive to Bank-Conflict-Free

The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM

Writing a Custom Attention Kernel: From Naive to Tiled to FlashAttention-Style