A naive matrix transpose kernel achieves 120 GB/s on a V100 β€” 13% of HBM peak β€” because half the writes are non-coalesced. Tile the transpose through shared memory and you hit 750 GB/s β€” 6.3x faster. A naive GEMM hits 45 GB/s. Tile through shared memory and you hit 680 GB/s β€” 15x faster. Shared memory is the 3 TB/s scratchpad between registers and HBM that transforms memory-bound kernels into compute-bound ones, but only if you avoid bank conflicts and structure accesses correctly.

Why Shared Memory Matters

The core value proposition: data reuse without HBM round-trips. Load data from global memory into shared memory once, then access it many times at 3x the bandwidth.

Kernel Performance: With vs Without Shared Memory Tiling

(GB/s effective bandwidth)
Matrix transpose (naive) Non-coalesced writes
120 GB/s effective bandwidth
Matrix transpose (SMEM tiled) 6.3x faster
750 GB/s effective bandwidth
Reduction (naive)
280 GB/s effective bandwidth
Reduction (SMEM) 2.9x faster
820 GB/s effective bandwidth
GEMM (naive)
45 GB/s effective bandwidth
GEMM (SMEM tiled) 15x faster
680 GB/s effective bandwidth

The GEMM result is striking: tiling with shared memory gives 15x speedup because it converts O(N) global memory accesses per output element into O(N/TILE) accesses, with the rest served from SMEM.

Bank Conflicts: The Hidden Performance Killer

Shared memory is organized into 32 banks, each 4 bytes wide. In a single cycle, 32 threads in a warp can each access a different bank simultaneously. But when multiple threads access the same bank (different addresses within that bank), accesses are serialized.

πŸ“Š

Bank Conflict Impact on Shared Memory Bandwidth (V100)

Conflict DegreeAccess PatternEffective BandwidthPenalty
None (1-way) Stride 1 -- consecutive 2.8 TB/s 1x (optimal)
2-way Stride 2 1.4 TB/s 2x slower
4-way Stride 4 0.7 TB/s 4x slower
8-way Stride 8 0.35 TB/s 8x slower
32-way Stride 32 (all same bank) 0.09 TB/s 32x slower
Note: A 32-way bank conflict serializes all 32 threads -- shared memory becomes slower than L2 cache.

The bank for address addr is: bank = (addr / 4) % 32. Two threads conflict when they access different 4-byte words in the same bank.

Detecting Bank Conflicts

# Nsight Compute reports bank conflicts directly
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared \
    ./my_kernel

If l1tex__data_bank_conflicts_pipe_lsu_mem_shared is high (over 10% of shared memory accesses), you have a bank conflict problem.

Fixing Bank Conflicts: The Padding Trick

The most common bank conflict occurs when accessing columns of a 2D shared memory array:

// PROBLEM: Column access causes 32-way bank conflicts
__shared__ float tile[32][32];
float val = tile[threadIdx.x][col];  // All threads hit bank (col*32*4/4)%32 = col

// FIX: Pad each row by 1 element
__shared__ float tile[32][33];  // 33 instead of 32
float val = tile[threadIdx.x][col];  // Bank = ((col + threadIdx.x*33)*4/4)%32 -- all different!

The single extra element per row shifts each row’s alignment, breaking the bank conflict pattern. Cost: 32 extra floats (128 bytes) per tile β€” negligible.

πŸ’‘ The Universal Fix

For any NxN shared memory array accessed by column, declare it as [N][N+1]. This eliminates all column-access bank conflicts at the cost of ~Nx4 bytes of padding.

Tiling Strategy: Matrix Transpose Example

Matrix transpose is the canonical shared memory example because naive transpose has non-coalesced writes:

// Naive: coalesced reads, non-coalesced writes (strided)
__global__ void transpose_naive(float *out, float *in, int N) {
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;
    out[x * N + y] = in[y * N + x];  // Write stride = N (non-coalesced!)
}

// Tiled: use shared memory to convert non-coalesced writes to coalesced
__global__ void transpose_tiled(float *out, float *in, int N) {
    __shared__ float tile[32][33];  // Padded to avoid bank conflicts
    int x = blockIdx.x * 32 + threadIdx.x;
    int y = blockIdx.y * 32 + threadIdx.y;

    tile[threadIdx.y][threadIdx.x] = in[y * N + x];  // Coalesced read
    __syncthreads();

    x = blockIdx.y * 32 + threadIdx.x;  // Note: swapped block indices
    y = blockIdx.x * 32 + threadIdx.y;
    out[y * N + x] = tile[threadIdx.x][threadIdx.y];  // Coalesced write!
}

The trick: load a 32x32 tile from global memory (coalesced reads), transpose it in shared memory (fast, no bank conflicts with padding), then write it back (coalesced writes). Both reads and writes are now coalesced.

πŸ“Š

Matrix Transpose Performance (4096x4096, V100)

ImplementationBandwidth (GB/s)EfficiencyBank Conflicts
Naive (non-coalesced write) 120 13% N/A
SMEM tiled (no padding) 580 64% 32-way on writes
SMEM tiled + padding 750 83% None
V100 theoretical peak 900 100% --

The padding alone gives a 29% speedup over the unpadded tiled version. Combined with tiling, total speedup vs naive is 6.3x.

Shared Memory Configuration

On Ampere+ GPUs, shared memory and L1 cache share a configurable 192KB pool per SM:

πŸ“Š

Shared Memory Configurations (A100)

ConfigShared MemoryL1 CacheBest For
Default 48 KB 144 KB Most kernels
Max SMEM 164 KB 28 KB Large tiles (GEMM, FlashAttention)
Max L1 28 KB 164 KB Streaming kernels with poor reuse
Note: Set with cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, bytes)

FlashAttention uses 160KB+ of shared memory per block to hold Q, K, V, and output tiles simultaneously β€” it explicitly requests maximum shared memory configuration.

ℹ️ The Occupancy Trade-off

Using more shared memory per block reduces the number of concurrent blocks per SM (lower occupancy). This is often worthwhile: a kernel running at 50% occupancy with large SMEM tiles can outperform 100% occupancy with no tiling, because the bandwidth savings from data reuse outweigh the loss of latency hiding.

Common Patterns

Reduction: Load N elements into shared memory, reduce in parallel with log(N) steps. Each step halves the active threads. Use sequential addressing (not interleaved) to avoid bank conflicts in later steps.

Stencil/convolution: Load a tile plus halo region into shared memory. Each thread computes its output from the cached neighborhood. Avoids redundant global memory loads for overlapping regions.

Histogram: Use shared memory for per-block partial histograms, then reduce across blocks. Avoids atomic contention on global memory.

Conclusion

Shared memory optimization follows a clear pattern: identify redundant global memory accesses, load data into SMEM tiles, pad arrays to eliminate bank conflicts, and synchronize threads between load and compute phases. The payoff ranges from 3x (reductions) to 15x (GEMM tiling) depending on the data reuse factor. Always verify with Nsight Compute that bank conflicts are eliminated β€” a single 32-way conflict can negate the entire benefit of tiling.