Full fine-tuning a 70B parameter model requires storing the model weights, the gradients, and the optimizer states in memory simultaneously. With AdamW (the standard optimizer for LLM fine-tuning), each parameter requires:

  • 2 bytes for the FP16 weight
  • 2 bytes for the FP16 gradient
  • 4 bytes for the FP32 master weight (mixed precision)
  • 4 bytes for the FP32 first moment (Adam mm)
  • 4 bytes for the FP32 second moment (Adam vv)

Total: 16 bytes per parameter. For 70 billion parameters:

Memory=70×109×16=1.12×1012 bytes=1.12 TB\text{Memory} = 70 \times 10^9 \times 16 = 1.12 \times 10^{12} \text{ bytes} = 1.12 \text{ TB}

No single GPU has 1.12 TB of memory. An 8-GPU node with 80 GB H100s provides 640 GB total. Even with DeepSpeed ZeRO Stage 3 sharding across all 8 GPUs, each GPU holds 140 GB of state — still exceeding the 80 GB limit. Full fine-tuning a 70B model requires at minimum 16 H100s (1.28 TB total), and realistically 32 GPUs after accounting for activations and batch data.

Parameter-efficient fine-tuning (PEFT) methods reduce the memory requirement by orders of magnitude. This post covers five methods — LoRA, DoRA, QLoRA, GaLore, and LISA — with the math, the memory budgets, the quality tradeoffs, and a decision matrix for when to use each.

1. LoRA: Low-Rank Adaptation

The Math

A pretrained weight matrix W0Rdout×dinW_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}} is frozen. The fine-tuned weight is:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

where BRdout×rB \in \mathbb{R}^{d_{\text{out}} \times r} and ARr×dinA \in \mathbb{R}^{r \times d_{\text{in}}}, with rank rmin(dout,din)r \ll \min(d_{\text{out}}, d_{\text{in}}).

During the forward pass:

h=Wx=W0x+BAxh = Wx = W_0 x + BAx

BB is initialized to zeros, AA is initialized with Kaiming uniform. This ensures ΔW=BA=0\Delta W = BA = 0 at the start of training — the model begins from the pretrained behavior.

A scaling factor α/r\alpha / r is applied:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} BAx

Typically α=r\alpha = r (so the effective scale is 1.0), but α\alpha can be tuned independently. Higher α\alpha amplifies the adapter’s contribution; lower α\alpha keeps the model closer to pretrained weights.

Parameter Count

For Llama 70B with dmodel=8192d_{\text{model}} = 8192, LoRA is typically applied to the attention weight matrices: WQW_Q, WKW_K, WVW_V, WOW_O. Each has dimensions 8192×81928192 \times 8192 (ignoring GQA for now).

Per adapted matrix at rank r=16r = 16:

LoRA params=dout×r+r×din=8192×16+16×8192=262,144\text{LoRA params} = d_{\text{out}} \times r + r \times d_{\text{in}} = 8192 \times 16 + 16 \times 8192 = 262{,}144

Across 4 attention matrices, 80 layers:

Total LoRA params=4×80×262,144=83,886,08084M\text{Total LoRA params} = 4 \times 80 \times 262{,}144 = 83{,}886{,}080 \approx 84\text{M}

That is 0.12% of the base model’s 70B parameters.

Memory Budget

def lora_memory_budget(
    model_params_billions,
    rank,
    num_adapted_matrices_per_layer,
    num_layers,
    d_model,
    base_dtype_bytes=2,     # FP16
    adapter_dtype_bytes=2,  # FP16
):
    """
    Calculate total memory for LoRA fine-tuning.
    """
    # Base model: frozen, no gradients or optimizer states
    base_model_bytes = model_params_billions * 1e9 * base_dtype_bytes

    # LoRA adapter parameters
    params_per_matrix = 2 * d_model * rank
    total_adapter_params = (
        num_adapted_matrices_per_layer * num_layers * params_per_matrix
    )

    # Adapter weights (FP16)
    adapter_weight_bytes = total_adapter_params * adapter_dtype_bytes
    # Adapter gradients (FP16)
    adapter_grad_bytes = total_adapter_params * adapter_dtype_bytes
    # Optimizer states: master weights (FP32) + m (FP32) + v (FP32)
    adapter_optimizer_bytes = total_adapter_params * (4 + 4 + 4)

    total_adapter_bytes = (
        adapter_weight_bytes + adapter_grad_bytes + adapter_optimizer_bytes
    )

    return {
        "base_model_gb": base_model_bytes / 1e9,
        "adapter_total_gb": total_adapter_bytes / 1e9,
        "total_gb": (base_model_bytes + total_adapter_bytes) / 1e9,
        "adapter_params_millions": total_adapter_params / 1e6,
    }

# Llama 70B, rank 16, 4 attention matrices, 80 layers
budget = lora_memory_budget(70, 16, 4, 80, 8192)
# base_model_gb: 140.0
# adapter_total_gb: 1.34
# total_gb: 141.34
# adapter_params_millions: 83.9
📊

LoRA Memory Budget (Llama 70B)

ComponentSizePercentage
Base model (FP16, frozen) 140.0 GB 99.05%
Adapter weights (FP16) 0.168 GB 0.12%
Adapter gradients (FP16) 0.168 GB 0.12%
Adapter optimizer (FP32, m+v+master) 1.006 GB 0.71%
Total 141.34 GB 100%

LoRA reduces the trainable state from 1.12 TB (full fine-tuning) to 1.34 GB. The base model still requires 140 GB, but it needs no gradients or optimizer states.

Implementation

class LoRALinear:
    """LoRA-adapted linear layer."""

    def __init__(self, base_linear, rank, alpha):
        self.base = base_linear  # Frozen
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank

        d_out, d_in = base_linear.weight.shape
        # A: [rank, d_in], initialized with Kaiming uniform
        self.lora_A = torch.nn.Parameter(
            torch.empty(rank, d_in)
        )
        torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))

        # B: [d_out, rank], initialized to zero
        self.lora_B = torch.nn.Parameter(
            torch.zeros(d_out, rank)
        )

    def forward(self, x):
        # Base forward (no gradient through base weights)
        with torch.no_grad():
            base_out = self.base(x)

        # LoRA forward
        lora_out = (x @ self.lora_A.T) @ self.lora_B.T
        return base_out + self.scaling * lora_out

Rank Selection

The rank rr controls the expressiveness-efficiency tradeoff:

LoRA Quality vs Rank (Llama 7B, Alpaca Eval)

(% win rate)
r=1 Underfitting
72.1 % win rate
r=4
78.3 % win rate
r=8
81.6 % win rate
r=16
83.2 % win rate
r=32
83.8 % win rate
r=64 Diminishing returns
84 % win rate
r=128
84.1 % win rate
Full FT Full fine-tuning
85.2 % win rate

The gap between r=16r = 16 and full fine-tuning is 2.0 percentage points. Between r=16r = 16 and r=128r = 128 is 0.9 points. The standard recommendation: start with r=16r = 16, increase to 64 only if quality is insufficient.

2. DoRA: Weight-Decomposed Low-Rank Adaptation

The Insight

DoRA (Liu et al., 2024) observes that LoRA’s updates to WW change both the magnitude and direction of the weight vectors simultaneously. Full fine-tuning tends to change direction more than magnitude. DoRA decouples the two: apply LoRA only to the direction component.

Decompose each column wiw_i of WW (or equivalently, each row depending on convention) into magnitude and direction:

W=mVVcW = m \cdot \frac{V}{\|V\|_c}

where mRdoutm \in \mathbb{R}^{d_{\text{out}}} is a learnable magnitude vector (one scalar per output dimension), VV is the direction matrix, and Vc\|V\|_c denotes the column-wise norm.

The fine-tuned weight becomes:

W=mV+ΔVV+ΔVcW' = m' \cdot \frac{V + \Delta V}{\|V + \Delta V\|_c}

where ΔV=BA\Delta V = BA (standard LoRA applied to the direction matrix VV), and mm' is a fine-tuned magnitude vector.

Implementation

class DoRALinear:
    """DoRA: Weight-Decomposed Low-Rank Adaptation."""

    def __init__(self, base_linear, rank, alpha):
        self.rank = rank
        self.scaling = alpha / rank

        d_out, d_in = base_linear.weight.shape
        W = base_linear.weight.data  # [d_out, d_in]

        # Decompose pretrained weight into magnitude and direction
        # Column-wise norm (norm of each row for [d_out, d_in] layout)
        col_norms = torch.norm(W, dim=1, keepdim=True)  # [d_out, 1]

        # m: learnable magnitude vector, initialized from pretrained norms
        self.magnitude = torch.nn.Parameter(
            col_norms.squeeze()  # [d_out]
        )

        # V: direction matrix (normalized columns of W)
        # V is NOT a parameter; it's derived from the frozen base weight
        self.base_weight = W  # Frozen

        # LoRA adapters for the direction update delta_V = B @ A
        self.lora_A = torch.nn.Parameter(torch.empty(rank, d_in))
        torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        self.lora_B = torch.nn.Parameter(torch.zeros(d_out, rank))

    def forward(self, x):
        # Compute V + delta_V
        delta_V = self.lora_B @ self.lora_A  # [d_out, d_in]
        V_prime = self.base_weight + self.scaling * delta_V

        # Normalize: direction only
        V_prime_norm = torch.norm(V_prime, dim=1, keepdim=True)  # [d_out, 1]
        V_prime_normalized = V_prime / V_prime_norm.clamp(min=1e-8)

        # Apply magnitude
        W_prime = self.magnitude.unsqueeze(1) * V_prime_normalized  # [d_out, d_in]

        return x @ W_prime.T

DoRA vs LoRA: Parameter Overhead

DoRA adds doutd_{\text{out}} learnable parameters per adapted matrix (the magnitude vector mm'). For dout=8192d_{\text{out}} = 8192:

DoRA extra params per matrix=8,192\text{DoRA extra params per matrix} = 8{,}192

LoRA params per matrix at r=16=262,144\text{LoRA params per matrix at } r=16 = 262{,}144

DoRA overhead=8,192262,144=3.1%\text{DoRA overhead} = \frac{8{,}192}{262{,}144} = 3.1\%

The magnitude vector is negligible. DoRA’s memory cost is essentially identical to LoRA.

📊

DoRA vs LoRA Quality Comparison

BenchmarkLoRA (r=16)DoRA (r=16)Full FTDoRA Gain over LoRA
Alpaca Eval (Llama-7B) 83.2% 83.7% 85.2% +0.5%
MMLU (Llama-7B) 52.8% 53.4% 54.1% +0.6%
GSM8K (Llama-7B) 38.4% 39.1% 40.2% +0.7%
HumanEval (Code Llama-7B) 32.9% 33.6% 35.4% +0.7%
Avg across 8 benchmarks +0.5%

DoRA consistently outperforms LoRA by 0.5-0.7% across benchmarks at the same rank. The gain is small but free — no additional memory, minimal compute overhead (one extra normalization per forward pass).

When DoRA Matters

The 0.5% average gain sounds small, but on tasks where direction change dominates (math reasoning, code generation), DoRA’s advantage can reach 0.7-1.0%. The magnitude-direction decomposition better preserves the pretrained model’s learned feature scales while allowing the direction (which encodes semantic relationships) to adapt. For instruction-following tasks where the change from pretrained is mostly stylistic (small direction change, large magnitude change), LoRA and DoRA perform similarly.

3. QLoRA: Quantized Low-Rank Adaptation

The Key Idea

QLoRA (Dettmers et al., 2023) quantizes the frozen base model to 4-bit NF4 (Normal Float 4-bit) and keeps the LoRA adapters in FP16. The base model shrinks from 2 bytes per parameter to 0.5 bytes, reducing the dominant memory term by 4x.

For Llama 70B:

Base model (FP16)=70×109×2=140 GB\text{Base model (FP16)} = 70 \times 10^9 \times 2 = 140 \text{ GB}

Base model (NF4)=70×109×0.5=35 GB\text{Base model (NF4)} = 70 \times 10^9 \times 0.5 = 35 \text{ GB}

Adding LoRA adapters (1.34 GB) and activations (~5-10 GB):

QLoRA total35+1.34+7=43.34 GB\text{QLoRA total} \approx 35 + 1.34 + 7 = 43.34 \text{ GB}

A 70B model fine-tuned on a single 48 GB GPU (A6000 or L40S). That is the practical impact.

NF4 Quantization

NF4 is not uniform quantization. It assumes the weight distribution is approximately normal (which empirical evidence supports for transformer weights). The 16 quantization levels are spaced according to the quantiles of a standard normal distribution:

def compute_nf4_levels():
    """
    Compute the 16 NF4 quantization levels.
    These are the quantiles of N(0, 1) that divide the distribution
    into 16 equal-probability bins, plus adjustments for the tails.
    """
    import scipy.stats as stats

    num_levels = 16
    # 16 quantiles dividing N(0,1) into equal-probability regions
    # Actually NF4 uses 2^4 = 16 levels with specific placement
    # Half for negative, half for positive, zero-centered
    levels = []
    for i in range(8):
        # Negative half
        q = (2 * i + 1) / (2 * 16)
        levels.append(stats.norm.ppf(q))
    for i in range(8):
        # Positive half
        q = 0.5 + (2 * i + 1) / (2 * 16)
        levels.append(stats.norm.ppf(q))

    levels = sorted(levels)
    # Normalize so max absolute value = 1
    max_abs = max(abs(l) for l in levels)
    levels = [l / max_abs for l in levels]
    return levels

NF4_LEVELS = compute_nf4_levels()
# Approximately: [-1.0, -0.69, -0.52, -0.39, -0.28, -0.18, -0.09, 0.0,
#                  0.08, 0.17, 0.27, 0.38, 0.51, 0.68, 0.91, 1.0]

Each weight is quantized to the nearest NF4 level, then stored as a 4-bit index. The dequantization lookup table is 16 values per quantization group (typically 64 or 128 weights share one scale factor):

def quantize_nf4(tensor, group_size=64):
    """Quantize a tensor to NF4 format."""
    flat = tensor.reshape(-1)
    num_groups = len(flat) // group_size
    flat = flat[:num_groups * group_size].reshape(num_groups, group_size)

    # Per-group absmax scaling
    scales = flat.abs().max(dim=1, keepdim=True).values  # [num_groups, 1]
    normalized = flat / scales.clamp(min=1e-8)            # [-1, 1]

    # Find nearest NF4 level for each element
    nf4 = torch.tensor(NF4_LEVELS, dtype=tensor.dtype)
    # distances[i][j] = |normalized[i] - nf4[j]|
    distances = (normalized.unsqueeze(-1) - nf4.unsqueeze(0).unsqueeze(0)).abs()
    indices = distances.argmin(dim=-1)  # 4-bit indices [0..15]

    return QuantizedTensor(
        indices=indices.to(torch.uint8),  # Pack two 4-bit values per byte
        scales=scales.squeeze(),
        group_size=group_size,
    )

def dequantize_nf4(qtensor):
    """Dequantize NF4 back to floating point for the forward pass."""
    nf4 = torch.tensor(NF4_LEVELS, dtype=torch.float16)
    values = nf4[qtensor.indices.long()]  # Lookup: index -> NF4 level
    values = values * qtensor.scales.unsqueeze(1)  # Rescale
    return values.reshape(-1)

Double Quantization

QLoRA applies a second level of quantization to the scale factors themselves. Each group’s FP32 scale factor is quantized to FP8, reducing the scale storage from 4 bytes to 1 byte per group. For group size 64:

Scale overhead (FP32)=4 bytes64×0.5 bytes=12.5%\text{Scale overhead (FP32)} = \frac{4 \text{ bytes}}{64 \times 0.5 \text{ bytes}} = 12.5\%

Scale overhead (FP8, double quant)=1 byte64×0.5 bytes=3.1%\text{Scale overhead (FP8, double quant)} = \frac{1 \text{ byte}}{64 \times 0.5 \text{ bytes}} = 3.1\%

Double quantization saves 9.4% on the scale overhead, which for a 70B model translates to:

Savings=70×109×0.5×0.094=3.29 GB\text{Savings} = 70 \times 10^9 \times 0.5 \times 0.094 = 3.29 \text{ GB}

QLoRA Forward Pass

class QLoRALinear:
    """QLoRA: 4-bit NF4 base + FP16 LoRA adapters."""

    def __init__(self, base_weight_quantized, rank, alpha):
        self.base_q = base_weight_quantized  # NF4 QuantizedTensor
        self.rank = rank
        self.scaling = alpha / rank

        d_out, d_in = base_weight_quantized.shape
        self.lora_A = torch.nn.Parameter(
            torch.empty(rank, d_in, dtype=torch.float16)
        )
        torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        self.lora_B = torch.nn.Parameter(
            torch.zeros(d_out, rank, dtype=torch.float16)
        )

    def forward(self, x):
        # Dequantize base weight on-the-fly (NF4 -> FP16)
        W_base = dequantize_nf4(self.base_q).reshape(
            self.base_q.shape
        ).to(x.dtype)

        # Base forward
        base_out = x @ W_base.T

        # LoRA forward (FP16 throughout)
        lora_out = (x @ self.lora_A.T) @ self.lora_B.T

        return base_out + self.scaling * lora_out
ℹ️ Dequantization Cost

Dequantizing NF4 to FP16 during the forward pass adds compute overhead. For each weight element: one table lookup (4-bit index to FP16 value) and one multiplication (by the group scale). On H100, this adds approximately 5-10% to the forward pass time compared to FP16 base weights. The memory savings (4x) far outweigh the compute cost.

QLoRA Memory Layout (Llama 70B)

Base Model (NF4, frozen) 70B params at 0.5 bytes/param 35 GB | Read-only, no gradients
NF4 Scale Factors (FP8) 1 scale per 64 weights 0.57 GB | Double-quantized
LoRA Adapters (FP16) 84M params at 2 bytes/param 0.168 GB | Trainable
LoRA Gradients (FP16) 84M params at 2 bytes/param 0.168 GB | Computed each step
LoRA Optimizer (FP32) 84M params x 12 bytes (m + v + master) 1.006 GB | AdamW state
Activations + Batch Data Depends on batch size and seq len 5-10 GB typical
📊

Memory Comparison: Full FT vs LoRA vs QLoRA (Llama 70B)

MethodBase ModelTrainable StateTotalMin GPUs (80GB)
Full Fine-Tuning (FP16+FP32) 140 GB 980 GB (grads+optim) 1,120 GB 16 (ZeRO-3)
LoRA (FP16 base, r=16) 140 GB 1.34 GB 141.34 GB 2
QLoRA (NF4 base, r=16) 35.6 GB 1.34 GB 43.34 GB 1
QLoRA (NF4 base, r=64) 35.6 GB 5.37 GB 47.37 GB 1

4. GaLore: Gradient Low-Rank Projection

The Key Idea

GaLore (Zhao et al., 2024) takes a fundamentally different approach. Instead of constraining the weight update to low-rank (like LoRA), it constrains the gradient to low-rank during training but updates the full weight matrix. At inference, there is no adapter — the weights are the full fine-tuned weights.

The gradient matrix GRdout×dinG \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}} at each step is projected into a low-rank subspace:

G~=PGQT\tilde{G} = P G Q^T

where PRdout×rP \in \mathbb{R}^{d_{\text{out}} \times r} and QRdin×rQ \in \mathbb{R}^{d_{\text{in}} \times r} are projection matrices obtained from the SVD of GG (computed periodically, not every step).

The optimizer operates on the projected gradient G~Rr×r\tilde{G} \in \mathbb{R}^{r \times r} instead of the full GRdout×dinG \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}. This reduces the optimizer state from O(d2)O(d^2) to O(r2)O(r^2) per weight matrix.

Memory Analysis

For a weight matrix of dimensions dout×dind_{\text{out}} \times d_{\text{in}} with rank rr:

Full fine-tuning optimizer state:

Adam state=dout×din×(4+4)=8doutdin bytes (FP32 m + v)\text{Adam state} = d_{\text{out}} \times d_{\text{in}} \times (4 + 4) = 8 \cdot d_{\text{out}} \cdot d_{\text{in}} \text{ bytes (FP32 m + v)}

GaLore optimizer state:

Adam state=r×r×(4+4)=8r2 bytes\text{Adam state} = r \times r \times (4 + 4) = 8r^2 \text{ bytes}

For dout=din=8192d_{\text{out}} = d_{\text{in}} = 8192 and r=128r = 128:

Full: 8×8192×8192=512 MB per matrix\text{Full: } 8 \times 8192 \times 8192 = 512 \text{ MB per matrix}

GaLore: 8×128×128=128 KB per matrix\text{GaLore: } 8 \times 128 \times 128 = 128 \text{ KB per matrix}

A 4,000×4{,}000\times reduction in optimizer state per weight matrix.

Implementation

class GaLoreOptimizer:
    """
    GaLore: Gradient Low-Rank Projection for memory-efficient training.
    Updates full weights but stores optimizer state in low-rank form.
    """

    def __init__(self, params, lr, rank, svd_interval=200):
        self.lr = lr
        self.rank = rank
        self.svd_interval = svd_interval  # Recompute projections every N steps
        self.step_count = 0

        # Per-parameter state
        self.state = {}
        for p in params:
            self.state[p] = {
                "P": None,           # Left projection [d_out, rank]
                "Q": None,           # Right projection [d_in, rank]
                "m": None,           # First moment in projected space [rank, rank]
                "v": None,           # Second moment in projected space [rank, rank]
                "last_svd_step": -1,
            }

    def step(self):
        self.step_count += 1
        for p in self.state:
            if p.grad is None:
                continue

            s = self.state[p]
            G = p.grad.data  # [d_out, d_in]

            # Periodically recompute projection matrices via SVD
            if (self.step_count - s["last_svd_step"]) >= self.svd_interval:
                U, S_vals, Vt = torch.linalg.svd(G, full_matrices=False)
                s["P"] = U[:, :self.rank]       # [d_out, rank]
                s["Q"] = Vt[:self.rank, :].T    # [d_in, rank]
                s["last_svd_step"] = self.step_count

                # Reset optimizer state when subspace changes
                s["m"] = torch.zeros(self.rank, self.rank, device=p.device)
                s["v"] = torch.zeros(self.rank, self.rank, device=p.device)

            # Project gradient: G_proj = P^T @ G @ Q  [rank, rank]
            G_proj = s["P"].T @ G @ s["Q"]

            # Adam update in projected space
            s["m"] = 0.9 * s["m"] + 0.1 * G_proj
            s["v"] = 0.999 * s["v"] + 0.001 * G_proj ** 2
            m_hat = s["m"] / (1 - 0.9 ** self.step_count)
            v_hat = s["v"] / (1 - 0.999 ** self.step_count)
            update_proj = m_hat / (v_hat.sqrt() + 1e-8)

            # Project back to full space: delta_W = P @ update_proj @ Q^T
            delta_W = s["P"] @ update_proj @ s["Q"].T

            # Update full weight
            p.data -= self.lr * delta_W

GaLore vs LoRA: Key Differences

📊

GaLore vs LoRA Structural Comparison

PropertyLoRAGaLore
What is low-rank? The weight update (delta_W = BA) The gradient projection
Final weights W + BA (adapter required) Full W (no adapter)
Inference overhead Extra matmul per layer (or merge) None
Where memory is saved Fewer trainable params -> smaller optim state Smaller projected optim state
SVD computation None Every N steps (expensive)
Serving complexity Adapter loading/switching Standard model serving
⚠️ GaLore's SVD Cost

Computing the SVD of the gradient matrix every 200 steps is expensive. For a 8192×81928192 \times 8192 matrix, truncated SVD to rank 128 takes approximately 200 ms on an H100. Across 320 weight matrices (4 attention + 3 FFN per layer, 80 layers for a ~70B model if applied to all), that is 320×200=64320 \times 200 = 64 seconds every 200 steps, or 0.32 seconds per step amortized. For a training step that takes 5-10 seconds, this is 3-6% overhead. Acceptable, but not free.

GaLore Memory Budget

def galore_memory_budget(
    model_params_billions,
    rank,
    num_adapted_matrices_per_layer,
    num_layers,
    d_model,
):
    base_bytes = model_params_billions * 1e9 * 2  # FP16 weights
    gradient_bytes = model_params_billions * 1e9 * 2  # FP16 gradients (full)

    # Optimizer state: rank x rank per adapted matrix
    optim_per_matrix = rank * rank * (4 + 4)  # m + v in FP32
    total_optim = num_adapted_matrices_per_layer * num_layers * optim_per_matrix

    # Projection matrices: P [d_out, rank] + Q [d_in, rank] per matrix, FP16
    proj_per_matrix = 2 * d_model * rank * 2
    total_proj = num_adapted_matrices_per_layer * num_layers * proj_per_matrix

    return {
        "base_model_gb": base_bytes / 1e9,
        "gradients_gb": gradient_bytes / 1e9,
        "optimizer_gb": total_optim / 1e9,
        "projections_gb": total_proj / 1e9,
        "total_gb": (base_bytes + gradient_bytes + total_optim + total_proj) / 1e9,
    }

# Llama 70B, rank 128, 7 matrices per layer, 80 layers
budget = galore_memory_budget(70, 128, 7, 80, 8192)
# base_model_gb: 140.0
# gradients_gb: 140.0
# optimizer_gb: 0.073 GB  (vs 980 GB for full Adam!)
# projections_gb: 18.35 GB
# total_gb: 298.42 GB

GaLore’s total memory is 298 GB — far less than full fine-tuning’s 1,120 GB but more than LoRA’s 141 GB. The bottleneck is that GaLore still needs full-precision gradients for the SVD projection. This requires the full gradient tensor to be materialized, even though the optimizer only operates on the projected version.

5. LISA: Layerwise Importance Sampled Adaptation

The Key Idea

LISA (Pan et al., 2024) observes that not all layers need to be updated every training step. At each step, LISA randomly selects a small subset of layers to unfreeze (compute gradients and update weights), while all other layers remain frozen. This is conceptually simple: randomly skip most of the backward pass.

The algorithm:

  1. At each training step, sample kk layers uniformly at random from the LL total layers.
  2. Only compute gradients for the kk selected layers.
  3. Update weights of the selected layers using standard Adam.
  4. The remaining LkL - k layers contribute to the forward pass but not the backward pass.

Implementation

class LISATrainer:
    """
    LISA: Layerwise Importance Sampled Adaptation.
    Randomly freeze most layers each step.
    """

    def __init__(self, model, optimizer, num_active_layers, total_layers):
        self.model = model
        self.optimizer = optimizer
        self.num_active = num_active_layers  # k
        self.total_layers = total_layers      # L
        # Always unfreeze: embedding layer and output head
        self.always_active = ["embed_tokens", "lm_head"]

    def train_step(self, batch):
        # Step 1: Randomly select k layers to unfreeze
        layer_indices = list(range(self.total_layers))
        active_indices = set(random.sample(layer_indices, self.num_active))

        # Step 2: Set requires_grad based on selection
        for idx, layer in enumerate(self.model.layers):
            for param in layer.parameters():
                param.requires_grad = (idx in active_indices)

        # Always-active components
        for name in self.always_active:
            module = getattr(self.model, name)
            for param in module.parameters():
                param.requires_grad = True

        # Step 3: Forward pass (all layers participate)
        outputs = self.model(**batch)
        loss = outputs.loss

        # Step 4: Backward pass (only active layers compute gradients)
        loss.backward()

        # Step 5: Optimizer step (only updates params with gradients)
        self.optimizer.step()
        self.optimizer.zero_grad()

        return loss.item()

Memory Analysis

With kk active layers out of LL total, LISA stores optimizer states for only kk layers worth of parameters. For Llama 70B with L=80L = 80 and k=2k = 2:

Active params=kL×N=280×70B=1.75B\text{Active params} = \frac{k}{L} \times N = \frac{2}{80} \times 70\text{B} = 1.75\text{B}

Optimizer state=1.75×109×12=21 GB\text{Optimizer state} = 1.75 \times 10^9 \times 12 = 21 \text{ GB}

But LISA must also store FP16 gradients for the active layers and the always-active components:

Gradient memory=1.75×109×2=3.5 GB\text{Gradient memory} = 1.75 \times 10^9 \times 2 = 3.5 \text{ GB}

Total:

LISA total=140 (base)+21 (optim)+3.5 (grads)=164.5 GB\text{LISA total} = 140 \text{ (base)} + 21 \text{ (optim)} + 3.5 \text{ (grads)} = 164.5 \text{ GB}

ℹ️ LISA's Hidden Advantage: No Adapter at Inference

LISA updates the full model weights (for the sampled layers). At inference, there is no adapter to load, merge, or switch. The model serves exactly as a standard model. This eliminates the operational complexity of adapter management, which matters in production deployments serving hundreds of fine-tuned variants.

LISA vs LoRA Quality

LISA vs LoRA vs Full FT Quality (Llama 7B)

(% Alpaca Eval win rate)
LoRA r=16
83.2 % Alpaca Eval win rate
LISA k=2
82.8 % Alpaca Eval win rate
LISA k=4
83.9 % Alpaca Eval win rate
LISA k=8
84.5 % Alpaca Eval win rate
Full FT
85.2 % Alpaca Eval win rate

LISA with k=4k = 4 (5% of layers active) slightly outperforms LoRA at r=16r = 16. With k=8k = 8 (10% of layers), it approaches full fine-tuning quality. The key tradeoff: LISA’s memory is higher than LoRA (164 GB vs 141 GB for 70B) but lower than full fine-tuning (1,120 GB), and it produces standard model weights with no inference overhead.

6. Comprehensive Comparison

📊

Method Comparison (Llama 70B, Standard Settings)

MethodMemoryQuality (relative)Inference OverheadAdapter at Inference?
Full FT 1,120 GB 100% (baseline) None No
LoRA r=16 141 GB 97.6% +2-5% (extra matmul) Yes (or merge)
DoRA r=16 141 GB 98.2% +3-6% (extra matmul + norm) Yes (or merge)
QLoRA r=16 43 GB 96.8% +2-5% (if merged to FP16) Yes (or merge)
GaLore r=128 298 GB 99.1% None No
LISA k=4 165 GB 98.5% None No

Memory Breakdown by Method (Llama 70B)

Full FT: 1,120 GB Weights (140) + Grads (140) + Optim (840) 16 bytes/param. Needs 16+ GPUs (80GB each).
LoRA r=16: 141 GB Frozen base (140) + Adapter train state (1.34) 2 GPUs. 0.12% trainable params.
QLoRA r=16: 43 GB NF4 base (35.6) + Adapter train state (1.34) + Activations (6.4) 1 GPU (48GB). 4-bit base model.
GaLore r=128: 298 GB Base (140) + Full grads (140) + Projected optim (0.07) + Proj matrices (18.35) 4 GPUs. Full-weight updates, no adapter.
LISA k=4: 165 GB Base (140) + Partial grads (3.5) + Partial optim (21) 3 GPUs. Random layer selection.

7. Decision Matrix

The method selection depends on four factors: available GPU memory, quality requirements, whether you need adapter-based serving (multi-tenant), and training throughput requirements.

def select_finetuning_method(
    model_size_billions,
    available_gpu_memory_gb,
    num_gpus,
    quality_requirement,       # "maximum", "high", "acceptable"
    multi_tenant_serving,      # True if serving many fine-tuned variants
    training_throughput_priority,  # "high", "medium", "low"
):
    """
    Decision function for selecting PEFT method.
    """
    total_memory_gb = available_gpu_memory_gb * num_gpus
    base_fp16_gb = model_size_billions * 2  # GB

    # Can we even fit the base model?
    if total_memory_gb < base_fp16_gb * 0.3:
        return "Model too large for available hardware"

    # QLoRA: when memory is extremely constrained
    base_nf4_gb = model_size_billions * 0.5
    if total_memory_gb < base_fp16_gb + 5:
        if total_memory_gb >= base_nf4_gb + 10:
            return "QLoRA"
        return "Insufficient memory"

    # Multi-tenant serving: LoRA/DoRA (adapter-based)
    if multi_tenant_serving:
        if quality_requirement == "maximum":
            return "DoRA"
        return "LoRA"

    # Maximum quality, sufficient memory for GaLore
    full_ft_gb = model_size_billions * 16
    galore_gb = base_fp16_gb * 2 + 20  # Rough estimate
    lisa_gb = base_fp16_gb + 25         # Rough estimate

    if quality_requirement == "maximum":
        if total_memory_gb >= full_ft_gb:
            return "Full Fine-Tuning"
        if total_memory_gb >= galore_gb:
            return "GaLore"
        if total_memory_gb >= lisa_gb:
            return "LISA k=8"
        return "DoRA"

    if quality_requirement == "high":
        if total_memory_gb >= lisa_gb:
            return "LISA k=4"
        return "DoRA"

    # Acceptable quality, minimize cost
    if total_memory_gb >= base_nf4_gb + 10:
        return "QLoRA"
    return "LoRA"
📊

Decision Matrix Summary

ScenarioRecommended MethodReason
1 GPU (48GB), 70B model QLoRA Only method that fits
2 GPUs (80GB), 70B, multi-tenant LoRA or DoRA Adapter-based serving needed
4 GPUs (80GB), 70B, max quality, no adapters GaLore or LISA Full-weight update, no inference overhead
16+ GPUs, 70B, max quality Full Fine-Tuning Best quality, sufficient memory
1 GPU (24GB), 7B model QLoRA Memory constrained
1 GPU (80GB), 7B model, multi-tenant LoRA 14 GB base + 0.3 GB adapter; plenty of room
8 GPUs (80GB), 70B, fast iteration LoRA Fastest training (fewest backward params)

Training Throughput

Training speed varies significantly across methods because the backward pass cost scales with the number of parameters that require gradients:

Training Throughput (Llama 7B, 1x A100 80GB, batch=4)

(% of Full FT throughput)
Full FT (baseline) ZeRO-3 needed
100 % of Full FT throughput
LoRA r=16 60% faster
160 % of Full FT throughput
DoRA r=16 45% faster
145 % of Full FT throughput
QLoRA r=16 Deq overhead
130 % of Full FT throughput
GaLore r=128 SVD overhead
85 % of Full FT throughput
LISA k=4 40% faster
140 % of Full FT throughput

LoRA is the fastest because it computes gradients for only 0.12% of parameters. The backward pass for frozen layers still occurs (to propagate gradients to earlier LoRA modules), but no weight updates or optimizer steps are needed for frozen parameters.

GaLore is slower than full fine-tuning because of the SVD computation. LISA is faster than full fine-tuning because only k/Lk/L of the backward pass computes weight gradients (though the activation gradient still propagates through all layers for the forward dependency).

8. Combining Methods

Methods can be combined:

QLoRA + DoRA: Apply DoRA’s magnitude-direction decomposition on top of a 4-bit quantized base model. The magnitude vector is FP16 (8192 extra params per matrix), the direction LoRA adapters are FP16, and the base model is NF4. Memory cost is nearly identical to QLoRA with a ~0.5% quality improvement.

LISA + LoRA: Randomly select which layers get full-weight updates (LISA), apply LoRA to the remaining layers. This gives the selected layers full expressiveness while keeping memory bounded. Memory is similar to LISA alone, but the LoRA adapters on non-selected layers provide continuous adaptation.

GaLore + Quantized Base: Apply GaLore’s gradient projection with an INT8-quantized base model. The base model shrinks from 140 GB to 70 GB (INT8), and the optimizer state remains tiny (r×rr \times r per matrix). This brings GaLore to approximately 210 GB for a 70B model — closer to the 3-GPU range.

💡 Practical Recommendation for 2025-2026

For most production fine-tuning in 2025: use QLoRA for experimentation (fast iteration on 1 GPU), DoRA for production quality (2+ GPUs), and GaLore when you need full-weight updates without adapters. LISA is the right choice when you have moderate GPU memory and need standard model weights at inference (no adapter management overhead).

Reviewer Agent Validation

Challenge: Using only this post, implement a minimal LoRALinear class and demonstrate that at initialization (B=0B = 0), the output equals the base model’s output exactly. Then apply one gradient step and show the output diverges.

Expected test:

import torch

d_in, d_out, rank = 64, 64, 4
base = torch.nn.Linear(d_in, d_out, bias=False)
lora = LoRALinear(base, rank=rank, alpha=rank)

x = torch.randn(1, d_in)

# At init: B=0, so LoRA contribution is zero
out_base = base(x)
out_lora = lora.forward(x)
assert torch.allclose(out_base, out_lora, atol=1e-6), (
    f"At init, LoRA output should equal base output. "
    f"Diff: {(out_base - out_lora).abs().max().item()}"
)

# Simulate one gradient step: set B to non-zero
with torch.no_grad():
    lora.lora_B.fill_(0.1)

out_after = lora.forward(x)
diff = (out_after - out_base).abs().max().item()
assert diff > 0.001, (
    f"After updating B, output should diverge from base. Diff: {diff}"
)
# The difference should be approximately: scaling * ||B @ A @ x||
# With B=0.1 (all entries), A ~ Kaiming init, x ~ N(0,1):
# expected diff ~ (alpha/rank) * 0.1 * sqrt(d_in) * sqrt(rank) ~ O(1)

If the Reviewer Agent can implement LoRALinear with correct zero-initialization of BB, Kaiming initialization of AA, the scaling factor α/r\alpha / r, and the forward pass W0x+(α/r)BAxW_0 x + (\alpha/r) B A x, and both assertions pass, the LoRA formulation was explained with sufficient precision.