A naive matrix transpose kernel achieves 120 GB/s on a V100 β 13% of HBM peak β because half the writes are non-coalesced. Tile the transpose through shared memory and you hit 750 GB/s β 6.3x faster. A naive GEMM hits 45 GB/s. Tile through shared memory and you hit 680 GB/s β 15x faster. Shared memory is the 3 TB/s scratchpad between registers and HBM that transforms memory-bound kernels into compute-bound ones, but only if you avoid bank conflicts and structure accesses correctly.
Why Shared Memory Matters
The core value proposition: data reuse without HBM round-trips. Load data from global memory into shared memory once, then access it many times at 3x the bandwidth.
Kernel Performance: With vs Without Shared Memory Tiling
(GB/s effective bandwidth)The GEMM result is striking: tiling with shared memory gives 15x speedup because it converts O(N) global memory accesses per output element into O(N/TILE) accesses, with the rest served from SMEM.
Bank Conflicts: The Hidden Performance Killer
Shared memory is organized into 32 banks, each 4 bytes wide. In a single cycle, 32 threads in a warp can each access a different bank simultaneously. But when multiple threads access the same bank (different addresses within that bank), accesses are serialized.
Bank Conflict Impact on Shared Memory Bandwidth (V100)
| Conflict Degree | Access Pattern | Effective Bandwidth | Penalty |
|---|---|---|---|
| None (1-way) | Stride 1 -- consecutive | 2.8 TB/s | 1x (optimal) |
| 2-way | Stride 2 | 1.4 TB/s | 2x slower |
| 4-way | Stride 4 | 0.7 TB/s | 4x slower |
| 8-way | Stride 8 | 0.35 TB/s | 8x slower |
| 32-way | Stride 32 (all same bank) | 0.09 TB/s | 32x slower |
The bank for address addr is: bank = (addr / 4) % 32. Two threads conflict when they access different 4-byte words in the same bank.
Detecting Bank Conflicts
# Nsight Compute reports bank conflicts directly
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared \
./my_kernel
If l1tex__data_bank_conflicts_pipe_lsu_mem_shared is high (over 10% of shared memory accesses), you have a bank conflict problem.
Fixing Bank Conflicts: The Padding Trick
The most common bank conflict occurs when accessing columns of a 2D shared memory array:
// PROBLEM: Column access causes 32-way bank conflicts
__shared__ float tile[32][32];
float val = tile[threadIdx.x][col]; // All threads hit bank (col*32*4/4)%32 = col
// FIX: Pad each row by 1 element
__shared__ float tile[32][33]; // 33 instead of 32
float val = tile[threadIdx.x][col]; // Bank = ((col + threadIdx.x*33)*4/4)%32 -- all different!
The single extra element per row shifts each rowβs alignment, breaking the bank conflict pattern. Cost: 32 extra floats (128 bytes) per tile β negligible.
For any NxN shared memory array accessed by column, declare it as [N][N+1]. This eliminates all column-access bank conflicts at the cost of ~Nx4 bytes of padding.
Tiling Strategy: Matrix Transpose Example
Matrix transpose is the canonical shared memory example because naive transpose has non-coalesced writes:
// Naive: coalesced reads, non-coalesced writes (strided)
__global__ void transpose_naive(float *out, float *in, int N) {
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
out[x * N + y] = in[y * N + x]; // Write stride = N (non-coalesced!)
}
// Tiled: use shared memory to convert non-coalesced writes to coalesced
__global__ void transpose_tiled(float *out, float *in, int N) {
__shared__ float tile[32][33]; // Padded to avoid bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
tile[threadIdx.y][threadIdx.x] = in[y * N + x]; // Coalesced read
__syncthreads();
x = blockIdx.y * 32 + threadIdx.x; // Note: swapped block indices
y = blockIdx.x * 32 + threadIdx.y;
out[y * N + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write!
}
The trick: load a 32x32 tile from global memory (coalesced reads), transpose it in shared memory (fast, no bank conflicts with padding), then write it back (coalesced writes). Both reads and writes are now coalesced.
Matrix Transpose Performance (4096x4096, V100)
| Implementation | Bandwidth (GB/s) | Efficiency | Bank Conflicts |
|---|---|---|---|
| Naive (non-coalesced write) | 120 | 13% | N/A |
| SMEM tiled (no padding) | 580 | 64% | 32-way on writes |
| SMEM tiled + padding | 750 | 83% | None |
| V100 theoretical peak | 900 | 100% | -- |
The padding alone gives a 29% speedup over the unpadded tiled version. Combined with tiling, total speedup vs naive is 6.3x.
Shared Memory Configuration
On Ampere+ GPUs, shared memory and L1 cache share a configurable 192KB pool per SM:
Shared Memory Configurations (A100)
| Config | Shared Memory | L1 Cache | Best For |
|---|---|---|---|
| Default | 48 KB | 144 KB | Most kernels |
| Max SMEM | 164 KB | 28 KB | Large tiles (GEMM, FlashAttention) |
| Max L1 | 28 KB | 164 KB | Streaming kernels with poor reuse |
FlashAttention uses 160KB+ of shared memory per block to hold Q, K, V, and output tiles simultaneously β it explicitly requests maximum shared memory configuration.
Using more shared memory per block reduces the number of concurrent blocks per SM (lower occupancy). This is often worthwhile: a kernel running at 50% occupancy with large SMEM tiles can outperform 100% occupancy with no tiling, because the bandwidth savings from data reuse outweigh the loss of latency hiding.
Common Patterns
Reduction: Load N elements into shared memory, reduce in parallel with log(N) steps. Each step halves the active threads. Use sequential addressing (not interleaved) to avoid bank conflicts in later steps.
Stencil/convolution: Load a tile plus halo region into shared memory. Each thread computes its output from the cached neighborhood. Avoids redundant global memory loads for overlapping regions.
Histogram: Use shared memory for per-block partial histograms, then reduce across blocks. Avoids atomic contention on global memory.
Conclusion
Shared memory optimization follows a clear pattern: identify redundant global memory accesses, load data into SMEM tiles, pad arrays to eliminate bank conflicts, and synchronize threads between load and compute phases. The payoff ranges from 3x (reductions) to 15x (GEMM tiling) depending on the data reuse factor. Always verify with Nsight Compute that bank conflicts are eliminated β a single 32-way conflict can negate the entire benefit of tiling.