Efficient Fine-Tuning: LoRA, DoRA, QLoRA, GaLore, and LISA — When to Use Each

Part of Series Frontier Research 2025-2026 11 of 9

1 Reasoning Scaling Laws: How Inference-Time Compute Changes Everything We Know About Scaling 2 Lightning Attention: Implementing Linear-Time Attention for Million-Token Contexts 3 Policy of Thoughts: Test-Time Policy Evolution and Online Reasoning Refinement 4 Test-Time Compute Scaling: When a 1B Model Beats a 405B Model 5 Reward Model Engineering: ORM vs PRM, Verifier Design, and Why Reward Quality Determines Reasoning Quality 6 Constitutional AI and RLHF Alternatives: DPO, KTO, ORPO, and the Post-Training Revolution 7 Long-Context Research: Architectures and Techniques for 1M to 10M+ Token Windows 8 Multimodal Fusion: Early vs Late Fusion, Cross-Attention, and Interleaved Architectures 9 Efficient Fine-Tuning: LoRA, DoRA, QLoRA, GaLore, and LISA — When to Use Each

Full fine-tuning a 70B parameter model requires storing the model weights, the gradients, and the optimizer states in memory simultaneously. With AdamW (the standard optimizer for LLM fine-tuning), each parameter requires:

2 bytes for the FP16 weight
2 bytes for the FP16 gradient
4 bytes for the FP32 master weight (mixed precision)
4 bytes for the FP32 first moment (Adam $m$ )
4 bytes for the FP32 second moment (Adam $v$ )

Total: 16 bytes per parameter. For 70 billion parameters:

$\text{Memory} = 70 \times 10^9 \times 16 = 1.12 \times 10^{12} \text{ bytes} = 1.12 \text{ TB}$

No single GPU has 1.12 TB of memory. An 8-GPU node with 80 GB H100s provides 640 GB total. Even with DeepSpeed ZeRO Stage 3 sharding across all 8 GPUs, each GPU holds 140 GB of state — still exceeding the 80 GB limit. Full fine-tuning a 70B model requires at minimum 16 H100s (1.28 TB total), and realistically 32 GPUs after accounting for activations and batch data.

Parameter-efficient fine-tuning (PEFT) methods reduce the memory requirement by orders of magnitude. This post covers five methods — LoRA, DoRA, QLoRA, GaLore, and LISA — with the math, the memory budgets, the quality tradeoffs, and a decision matrix for when to use each.

1. LoRA: Low-Rank Adaptation

The Math

A pretrained weight matrix $W_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ is frozen. The fine-tuned weight is:

$W = W_0 + \Delta W = W_0 + BA$

where $B \in \mathbb{R}^{d_{\text{out}} \times r}$ and $A \in \mathbb{R}^{r \times d_{\text{in}}}$ , with rank $r \ll \min(d_{\text{out}}, d_{\text{in}})$ .

During the forward pass:

$h = Wx = W_0 x + BAx$

$B$ is initialized to zeros, $A$ is initialized with Kaiming uniform. This ensures $\Delta W = BA = 0$ at the start of training — the model begins from the pretrained behavior.

A scaling factor $\alpha / r$ is applied:

$h = W_0 x + \frac{\alpha}{r} BAx$

Typically $\alpha = r$ (so the effective scale is 1.0), but $\alpha$ can be tuned independently. Higher $\alpha$ amplifies the adapter’s contribution; lower $\alpha$ keeps the model closer to pretrained weights.

Parameter Count

For Llama 70B with $d_{\text{model}} = 8192$ , LoRA is typically applied to the attention weight matrices: $W_Q$ , $W_K$ , $W_V$ , $W_O$ . Each has dimensions $8192 \times 8192$ (ignoring GQA for now).

Per adapted matrix at rank $r = 16$ :

$\text{LoRA params} = d_{\text{out}} \times r + r \times d_{\text{in}} = 8192 \times 16 + 16 \times 8192 = 262{,}144$

Across 4 attention matrices, 80 layers:

$\text{Total LoRA params} = 4 \times 80 \times 262{,}144 = 83{,}886{,}080 \approx 84\text{M}$

That is 0.12% of the base model’s 70B parameters.

Memory Budget

def lora_memory_budget(
    model_params_billions,
    rank,
    num_adapted_matrices_per_layer,
    num_layers,
    d_model,
    base_dtype_bytes=2,     # FP16
    adapter_dtype_bytes=2,  # FP16
):
    """
    Calculate total memory for LoRA fine-tuning.
    """
    # Base model: frozen, no gradients or optimizer states
    base_model_bytes = model_params_billions * 1e9 * base_dtype_bytes

    # LoRA adapter parameters
    params_per_matrix = 2 * d_model * rank
    total_adapter_params = (
        num_adapted_matrices_per_layer * num_layers * params_per_matrix
    )

    # Adapter weights (FP16)
    adapter_weight_bytes = total_adapter_params * adapter_dtype_bytes
    # Adapter gradients (FP16)
    adapter_grad_bytes = total_adapter_params * adapter_dtype_bytes
    # Optimizer states: master weights (FP32) + m (FP32) + v (FP32)
    adapter_optimizer_bytes = total_adapter_params * (4 + 4 + 4)

    total_adapter_bytes = (
        adapter_weight_bytes + adapter_grad_bytes + adapter_optimizer_bytes
    )

    return {
        "base_model_gb": base_model_bytes / 1e9,
        "adapter_total_gb": total_adapter_bytes / 1e9,
        "total_gb": (base_model_bytes + total_adapter_bytes) / 1e9,
        "adapter_params_millions": total_adapter_params / 1e6,
    }

# Llama 70B, rank 16, 4 attention matrices, 80 layers
budget = lora_memory_budget(70, 16, 4, 80, 8192)
# base_model_gb: 140.0
# adapter_total_gb: 1.34
# total_gb: 141.34
# adapter_params_millions: 83.9

📊

LoRA Memory Budget (Llama 70B)

Component	Size	Percentage
Base model (FP16, frozen)	140.0 GB	99.05%
Adapter weights (FP16)	0.168 GB	0.12%
Adapter gradients (FP16)	0.168 GB	0.12%
Adapter optimizer (FP32, m+v+master)	1.006 GB	0.71%
Total	141.34 GB	100%

LoRA reduces the trainable state from 1.12 TB (full fine-tuning) to 1.34 GB. The base model still requires 140 GB, but it needs no gradients or optimizer states.

Implementation

class LoRALinear:
    """LoRA-adapted linear layer."""

    def __init__(self, base_linear, rank, alpha):
        self.base = base_linear  # Frozen
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank

        d_out, d_in = base_linear.weight.shape
        # A: [rank, d_in], initialized with Kaiming uniform
        self.lora_A = torch.nn.Parameter(
            torch.empty(rank, d_in)
        )
        torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))

        # B: [d_out, rank], initialized to zero
        self.lora_B = torch.nn.Parameter(
            torch.zeros(d_out, rank)
        )

    def forward(self, x):
        # Base forward (no gradient through base weights)
        with torch.no_grad():
            base_out = self.base(x)

        # LoRA forward
        lora_out = (x @ self.lora_A.T) @ self.lora_B.T
        return base_out + self.scaling * lora_out

Rank Selection

The rank $r$ controls the expressiveness-efficiency tradeoff:

LoRA Quality vs Rank (Llama 7B, Alpaca Eval)

(% win rate)

r=1 Underfitting

72.1 % win rate

r=4

78.3 % win rate

r=8

81.6 % win rate

r=16

83.2 % win rate

r=32

83.8 % win rate

r=64 Diminishing returns

84 % win rate

r=128

84.1 % win rate

Full FT Full fine-tuning

85.2 % win rate

The gap between $r = 16$ and full fine-tuning is 2.0 percentage points. Between $r = 16$ and $r = 128$ is 0.9 points. The standard recommendation: start with $r = 16$ , increase to 64 only if quality is insufficient.

2. DoRA: Weight-Decomposed Low-Rank Adaptation

The Insight

DoRA (Liu et al., 2024) observes that LoRA’s updates to $W$ change both the magnitude and direction of the weight vectors simultaneously. Full fine-tuning tends to change direction more than magnitude. DoRA decouples the two: apply LoRA only to the direction component.

Decompose each column $w_i$ of $W$ (or equivalently, each row depending on convention) into magnitude and direction:

$W = m \cdot \frac{V}{\|V\|_c}$

where $m \in \mathbb{R}^{d_{\text{out}}}$ is a learnable magnitude vector (one scalar per output dimension), $V$ is the direction matrix, and $\|V\|_c$ denotes the column-wise norm.

The fine-tuned weight becomes:

$W' = m' \cdot \frac{V + \Delta V}{\|V + \Delta V\|_c}$

where $\Delta V = BA$ (standard LoRA applied to the direction matrix $V$ ), and $m'$ is a fine-tuned magnitude vector.

Implementation

class DoRALinear:
    """DoRA: Weight-Decomposed Low-Rank Adaptation."""

    def __init__(self, base_linear, rank, alpha):
        self.rank = rank
        self.scaling = alpha / rank

        d_out, d_in = base_linear.weight.shape
        W = base_linear.weight.data  # [d_out, d_in]

        # Decompose pretrained weight into magnitude and direction
        # Column-wise norm (norm of each row for [d_out, d_in] layout)
        col_norms = torch.norm(W, dim=1, keepdim=True)  # [d_out, 1]

        # m: learnable magnitude vector, initialized from pretrained norms
        self.magnitude = torch.nn.Parameter(
            col_norms.squeeze()  # [d_out]
        )

        # V: direction matrix (normalized columns of W)
        # V is NOT a parameter; it's derived from the frozen base weight
        self.base_weight = W  # Frozen

        # LoRA adapters for the direction update delta_V = B @ A
        self.lora_A = torch.nn.Parameter(torch.empty(rank, d_in))
        torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        self.lora_B = torch.nn.Parameter(torch.zeros(d_out, rank))

    def forward(self, x):
        # Compute V + delta_V
        delta_V = self.lora_B @ self.lora_A  # [d_out, d_in]
        V_prime = self.base_weight + self.scaling * delta_V

        # Normalize: direction only
        V_prime_norm = torch.norm(V_prime, dim=1, keepdim=True)  # [d_out, 1]
        V_prime_normalized = V_prime / V_prime_norm.clamp(min=1e-8)

        # Apply magnitude
        W_prime = self.magnitude.unsqueeze(1) * V_prime_normalized  # [d_out, d_in]

        return x @ W_prime.T

DoRA vs LoRA: Parameter Overhead

DoRA adds $d_{\text{out}}$ learnable parameters per adapted matrix (the magnitude vector $m'$ ). For $d_{\text{out}} = 8192$ :

$\text{DoRA extra params per matrix} = 8{,}192$

$\text{LoRA params per matrix at } r=16 = 262{,}144$

$\text{DoRA overhead} = \frac{8{,}192}{262{,}144} = 3.1\%$

The magnitude vector is negligible. DoRA’s memory cost is essentially identical to LoRA.

📊

DoRA vs LoRA Quality Comparison

Benchmark	LoRA (r=16)	DoRA (r=16)	Full FT	DoRA Gain over LoRA
Alpaca Eval (Llama-7B)	83.2%	83.7%	85.2%	+0.5%
MMLU (Llama-7B)	52.8%	53.4%	54.1%	+0.6%
GSM8K (Llama-7B)	38.4%	39.1%	40.2%	+0.7%
HumanEval (Code Llama-7B)	32.9%	33.6%	35.4%	+0.7%
Avg across 8 benchmarks	—	—	—	+0.5%

DoRA consistently outperforms LoRA by 0.5-0.7% across benchmarks at the same rank. The gain is small but free — no additional memory, minimal compute overhead (one extra normalization per forward pass).

⚡ When DoRA Matters

The 0.5% average gain sounds small, but on tasks where direction change dominates (math reasoning, code generation), DoRA’s advantage can reach 0.7-1.0%. The magnitude-direction decomposition better preserves the pretrained model’s learned feature scales while allowing the direction (which encodes semantic relationships) to adapt. For instruction-following tasks where the change from pretrained is mostly stylistic (small direction change, large magnitude change), LoRA and DoRA perform similarly.

3. QLoRA: Quantized Low-Rank Adaptation

The Key Idea

QLoRA (Dettmers et al., 2023) quantizes the frozen base model to 4-bit NF4 (Normal Float 4-bit) and keeps the LoRA adapters in FP16. The base model shrinks from 2 bytes per parameter to 0.5 bytes, reducing the dominant memory term by 4x.

For Llama 70B:

$\text{Base model (FP16)} = 70 \times 10^9 \times 2 = 140 \text{ GB}$

$\text{Base model (NF4)} = 70 \times 10^9 \times 0.5 = 35 \text{ GB}$

Adding LoRA adapters (1.34 GB) and activations (~5-10 GB):

$\text{QLoRA total} \approx 35 + 1.34 + 7 = 43.34 \text{ GB}$

A 70B model fine-tuned on a single 48 GB GPU (A6000 or L40S). That is the practical impact.

NF4 Quantization

NF4 is not uniform quantization. It assumes the weight distribution is approximately normal (which empirical evidence supports for transformer weights). The 16 quantization levels are spaced according to the quantiles of a standard normal distribution:

def compute_nf4_levels():
    """
    Compute the 16 NF4 quantization levels.
    These are the quantiles of N(0, 1) that divide the distribution
    into 16 equal-probability bins, plus adjustments for the tails.
    """
    import scipy.stats as stats

    num_levels = 16
    # 16 quantiles dividing N(0,1) into equal-probability regions
    # Actually NF4 uses 2^4 = 16 levels with specific placement
    # Half for negative, half for positive, zero-centered
    levels = []
    for i in range(8):
        # Negative half
        q = (2 * i + 1) / (2 * 16)
        levels.append(stats.norm.ppf(q))
    for i in range(8):
        # Positive half
        q = 0.5 + (2 * i + 1) / (2 * 16)
        levels.append(stats.norm.ppf(q))

    levels = sorted(levels)
    # Normalize so max absolute value = 1
    max_abs = max(abs(l) for l in levels)
    levels = [l / max_abs for l in levels]
    return levels

NF4_LEVELS = compute_nf4_levels()
# Approximately: [-1.0, -0.69, -0.52, -0.39, -0.28, -0.18, -0.09, 0.0,
#                  0.08, 0.17, 0.27, 0.38, 0.51, 0.68, 0.91, 1.0]

Each weight is quantized to the nearest NF4 level, then stored as a 4-bit index. The dequantization lookup table is 16 values per quantization group (typically 64 or 128 weights share one scale factor):

def quantize_nf4(tensor, group_size=64):
    """Quantize a tensor to NF4 format."""
    flat = tensor.reshape(-1)
    num_groups = len(flat) // group_size
    flat = flat[:num_groups * group_size].reshape(num_groups, group_size)

    # Per-group absmax scaling
    scales = flat.abs().max(dim=1, keepdim=True).values  # [num_groups, 1]
    normalized = flat / scales.clamp(min=1e-8)            # [-1, 1]

    # Find nearest NF4 level for each element
    nf4 = torch.tensor(NF4_LEVELS, dtype=tensor.dtype)
    # distances[i][j] = |normalized[i] - nf4[j]|
    distances = (normalized.unsqueeze(-1) - nf4.unsqueeze(0).unsqueeze(0)).abs()
    indices = distances.argmin(dim=-1)  # 4-bit indices [0..15]

    return QuantizedTensor(
        indices=indices.to(torch.uint8),  # Pack two 4-bit values per byte
        scales=scales.squeeze(),
        group_size=group_size,
    )

def dequantize_nf4(qtensor):
    """Dequantize NF4 back to floating point for the forward pass."""
    nf4 = torch.tensor(NF4_LEVELS, dtype=torch.float16)
    values = nf4[qtensor.indices.long()]  # Lookup: index -> NF4 level
    values = values * qtensor.scales.unsqueeze(1)  # Rescale
    return values.reshape(-1)

Double Quantization

QLoRA applies a second level of quantization to the scale factors themselves. Each group’s FP32 scale factor is quantized to FP8, reducing the scale storage from 4 bytes to 1 byte per group. For group size 64:

$\text{Scale overhead (FP32)} = \frac{4 \text{ bytes}}{64 \times 0.5 \text{ bytes}} = 12.5\%$

$\text{Scale overhead (FP8, double quant)} = \frac{1 \text{ byte}}{64 \times 0.5 \text{ bytes}} = 3.1\%$

Double quantization saves 9.4% on the scale overhead, which for a 70B model translates to:

$\text{Savings} = 70 \times 10^9 \times 0.5 \times 0.094 = 3.29 \text{ GB}$

QLoRA Forward Pass

class QLoRALinear:
    """QLoRA: 4-bit NF4 base + FP16 LoRA adapters."""

    def __init__(self, base_weight_quantized, rank, alpha):
        self.base_q = base_weight_quantized  # NF4 QuantizedTensor
        self.rank = rank
        self.scaling = alpha / rank

        d_out, d_in = base_weight_quantized.shape
        self.lora_A = torch.nn.Parameter(
            torch.empty(rank, d_in, dtype=torch.float16)
        )
        torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        self.lora_B = torch.nn.Parameter(
            torch.zeros(d_out, rank, dtype=torch.float16)
        )

    def forward(self, x):
        # Dequantize base weight on-the-fly (NF4 -> FP16)
        W_base = dequantize_nf4(self.base_q).reshape(
            self.base_q.shape
        ).to(x.dtype)

        # Base forward
        base_out = x @ W_base.T

        # LoRA forward (FP16 throughout)
        lora_out = (x @ self.lora_A.T) @ self.lora_B.T

        return base_out + self.scaling * lora_out

ℹ️ Dequantization Cost

Dequantizing NF4 to FP16 during the forward pass adds compute overhead. For each weight element: one table lookup (4-bit index to FP16 value) and one multiplication (by the group scale). On H100, this adds approximately 5-10% to the forward pass time compared to FP16 base weights. The memory savings (4x) far outweigh the compute cost.

QLoRA Memory Layout (Llama 70B)

Base Model (NF4, frozen) 70B params at 0.5 bytes/param 35 GB | Read-only, no gradients

NF4 Scale Factors (FP8) 1 scale per 64 weights 0.57 GB | Double-quantized

LoRA Adapters (FP16) 84M params at 2 bytes/param 0.168 GB | Trainable

LoRA Gradients (FP16) 84M params at 2 bytes/param 0.168 GB | Computed each step

LoRA Optimizer (FP32) 84M params x 12 bytes (m + v + master) 1.006 GB | AdamW state

Activations + Batch Data Depends on batch size and seq len 5-10 GB typical

📊

Memory Comparison: Full FT vs LoRA vs QLoRA (Llama 70B)

Method	Base Model	Trainable State	Total	Min GPUs (80GB)
Full Fine-Tuning (FP16+FP32)	140 GB	980 GB (grads+optim)	1,120 GB	16 (ZeRO-3)
LoRA (FP16 base, r=16)	140 GB	1.34 GB	141.34 GB	2
QLoRA (NF4 base, r=16)	35.6 GB	1.34 GB	43.34 GB	1
QLoRA (NF4 base, r=64)	35.6 GB	5.37 GB	47.37 GB	1

4. GaLore: Gradient Low-Rank Projection

The Key Idea

GaLore (Zhao et al., 2024) takes a fundamentally different approach. Instead of constraining the weight update to low-rank (like LoRA), it constrains the gradient to low-rank during training but updates the full weight matrix. At inference, there is no adapter — the weights are the full fine-tuned weights.

The gradient matrix $G \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ at each step is projected into a low-rank subspace:

$\tilde{G} = P G Q^T$

where $P \in \mathbb{R}^{d_{\text{out}} \times r}$ and $Q \in \mathbb{R}^{d_{\text{in}} \times r}$ are projection matrices obtained from the SVD of $G$ (computed periodically, not every step).

The optimizer operates on the projected gradient $\tilde{G} \in \mathbb{R}^{r \times r}$ instead of the full $G \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ . This reduces the optimizer state from $O(d^2)$ to $O(r^2)$ per weight matrix.

Memory Analysis

For a weight matrix of dimensions $d_{\text{out}} \times d_{\text{in}}$ with rank $r$ :

Full fine-tuning optimizer state:

$\text{Adam state} = d_{\text{out}} \times d_{\text{in}} \times (4 + 4) = 8 \cdot d_{\text{out}} \cdot d_{\text{in}} \text{ bytes (FP32 m + v)}$

GaLore optimizer state:

$\text{Adam state} = r \times r \times (4 + 4) = 8r^2 \text{ bytes}$

For $d_{\text{out}} = d_{\text{in}} = 8192$ and $r = 128$ :

$\text{Full: } 8 \times 8192 \times 8192 = 512 \text{ MB per matrix}$

$\text{GaLore: } 8 \times 128 \times 128 = 128 \text{ KB per matrix}$

A $4{,}000\times$ reduction in optimizer state per weight matrix.

Implementation

class GaLoreOptimizer:
    """
    GaLore: Gradient Low-Rank Projection for memory-efficient training.
    Updates full weights but stores optimizer state in low-rank form.
    """

    def __init__(self, params, lr, rank, svd_interval=200):
        self.lr = lr
        self.rank = rank
        self.svd_interval = svd_interval  # Recompute projections every N steps
        self.step_count = 0

        # Per-parameter state
        self.state = {}
        for p in params:
            self.state[p] = {
                "P": None,           # Left projection [d_out, rank]
                "Q": None,           # Right projection [d_in, rank]
                "m": None,           # First moment in projected space [rank, rank]
                "v": None,           # Second moment in projected space [rank, rank]
                "last_svd_step": -1,
            }

    def step(self):
        self.step_count += 1
        for p in self.state:
            if p.grad is None:
                continue

            s = self.state[p]
            G = p.grad.data  # [d_out, d_in]

            # Periodically recompute projection matrices via SVD
            if (self.step_count - s["last_svd_step"]) >= self.svd_interval:
                U, S_vals, Vt = torch.linalg.svd(G, full_matrices=False)
                s["P"] = U[:, :self.rank]       # [d_out, rank]
                s["Q"] = Vt[:self.rank, :].T    # [d_in, rank]
                s["last_svd_step"] = self.step_count

                # Reset optimizer state when subspace changes
                s["m"] = torch.zeros(self.rank, self.rank, device=p.device)
                s["v"] = torch.zeros(self.rank, self.rank, device=p.device)

            # Project gradient: G_proj = P^T @ G @ Q  [rank, rank]
            G_proj = s["P"].T @ G @ s["Q"]

            # Adam update in projected space
            s["m"] = 0.9 * s["m"] + 0.1 * G_proj
            s["v"] = 0.999 * s["v"] + 0.001 * G_proj ** 2
            m_hat = s["m"] / (1 - 0.9 ** self.step_count)
            v_hat = s["v"] / (1 - 0.999 ** self.step_count)
            update_proj = m_hat / (v_hat.sqrt() + 1e-8)

            # Project back to full space: delta_W = P @ update_proj @ Q^T
            delta_W = s["P"] @ update_proj @ s["Q"].T

            # Update full weight
            p.data -= self.lr * delta_W

GaLore vs LoRA: Key Differences

📊

GaLore vs LoRA Structural Comparison

Property	LoRA	GaLore
What is low-rank?	The weight update (delta_W = BA)	The gradient projection
Final weights	W + BA (adapter required)	Full W (no adapter)
Inference overhead	Extra matmul per layer (or merge)	None
Where memory is saved	Fewer trainable params -> smaller optim state	Smaller projected optim state
SVD computation	None	Every N steps (expensive)
Serving complexity	Adapter loading/switching	Standard model serving

⚠️ GaLore's SVD Cost

Computing the SVD of the gradient matrix every 200 steps is expensive. For a $8192 \times 8192$ matrix, truncated SVD to rank 128 takes approximately 200 ms on an H100. Across 320 weight matrices (4 attention + 3 FFN per layer, 80 layers for a ~70B model if applied to all), that is $320 \times 200 = 64$ seconds every 200 steps, or 0.32 seconds per step amortized. For a training step that takes 5-10 seconds, this is 3-6% overhead. Acceptable, but not free.

GaLore Memory Budget

def galore_memory_budget(
    model_params_billions,
    rank,
    num_adapted_matrices_per_layer,
    num_layers,
    d_model,
):
    base_bytes = model_params_billions * 1e9 * 2  # FP16 weights
    gradient_bytes = model_params_billions * 1e9 * 2  # FP16 gradients (full)

    # Optimizer state: rank x rank per adapted matrix
    optim_per_matrix = rank * rank * (4 + 4)  # m + v in FP32
    total_optim = num_adapted_matrices_per_layer * num_layers * optim_per_matrix

    # Projection matrices: P [d_out, rank] + Q [d_in, rank] per matrix, FP16
    proj_per_matrix = 2 * d_model * rank * 2
    total_proj = num_adapted_matrices_per_layer * num_layers * proj_per_matrix

    return {
        "base_model_gb": base_bytes / 1e9,
        "gradients_gb": gradient_bytes / 1e9,
        "optimizer_gb": total_optim / 1e9,
        "projections_gb": total_proj / 1e9,
        "total_gb": (base_bytes + gradient_bytes + total_optim + total_proj) / 1e9,
    }

# Llama 70B, rank 128, 7 matrices per layer, 80 layers
budget = galore_memory_budget(70, 128, 7, 80, 8192)
# base_model_gb: 140.0
# gradients_gb: 140.0
# optimizer_gb: 0.073 GB  (vs 980 GB for full Adam!)
# projections_gb: 18.35 GB
# total_gb: 298.42 GB

GaLore’s total memory is 298 GB — far less than full fine-tuning’s 1,120 GB but more than LoRA’s 141 GB. The bottleneck is that GaLore still needs full-precision gradients for the SVD projection. This requires the full gradient tensor to be materialized, even though the optimizer only operates on the projected version.

5. LISA: Layerwise Importance Sampled Adaptation

The Key Idea

LISA (Pan et al., 2024) observes that not all layers need to be updated every training step. At each step, LISA randomly selects a small subset of layers to unfreeze (compute gradients and update weights), while all other layers remain frozen. This is conceptually simple: randomly skip most of the backward pass.

The algorithm:

At each training step, sample $k$ layers uniformly at random from the $L$ total layers.
Only compute gradients for the $k$ selected layers.
Update weights of the selected layers using standard Adam.
The remaining $L - k$ layers contribute to the forward pass but not the backward pass.

Implementation

class LISATrainer:
    """
    LISA: Layerwise Importance Sampled Adaptation.
    Randomly freeze most layers each step.
    """

    def __init__(self, model, optimizer, num_active_layers, total_layers):
        self.model = model
        self.optimizer = optimizer
        self.num_active = num_active_layers  # k
        self.total_layers = total_layers      # L
        # Always unfreeze: embedding layer and output head
        self.always_active = ["embed_tokens", "lm_head"]

    def train_step(self, batch):
        # Step 1: Randomly select k layers to unfreeze
        layer_indices = list(range(self.total_layers))
        active_indices = set(random.sample(layer_indices, self.num_active))

        # Step 2: Set requires_grad based on selection
        for idx, layer in enumerate(self.model.layers):
            for param in layer.parameters():
                param.requires_grad = (idx in active_indices)

        # Always-active components
        for name in self.always_active:
            module = getattr(self.model, name)
            for param in module.parameters():
                param.requires_grad = True

        # Step 3: Forward pass (all layers participate)
        outputs = self.model(**batch)
        loss = outputs.loss

        # Step 4: Backward pass (only active layers compute gradients)
        loss.backward()

        # Step 5: Optimizer step (only updates params with gradients)
        self.optimizer.step()
        self.optimizer.zero_grad()

        return loss.item()

Memory Analysis

With $k$ active layers out of $L$ total, LISA stores optimizer states for only $k$ layers worth of parameters. For Llama 70B with $L = 80$ and $k = 2$ :

$\text{Active params} = \frac{k}{L} \times N = \frac{2}{80} \times 70\text{B} = 1.75\text{B}$

$\text{Optimizer state} = 1.75 \times 10^9 \times 12 = 21 \text{ GB}$

But LISA must also store FP16 gradients for the active layers and the always-active components:

$\text{Gradient memory} = 1.75 \times 10^9 \times 2 = 3.5 \text{ GB}$

Total:

$\text{LISA total} = 140 \text{ (base)} + 21 \text{ (optim)} + 3.5 \text{ (grads)} = 164.5 \text{ GB}$

ℹ️ LISA's Hidden Advantage: No Adapter at Inference

LISA updates the full model weights (for the sampled layers). At inference, there is no adapter to load, merge, or switch. The model serves exactly as a standard model. This eliminates the operational complexity of adapter management, which matters in production deployments serving hundreds of fine-tuned variants.

LISA vs LoRA Quality

LISA vs LoRA vs Full FT Quality (Llama 7B)

(% Alpaca Eval win rate)

LoRA r=16

83.2 % Alpaca Eval win rate

LISA k=2

82.8 % Alpaca Eval win rate

LISA k=4

83.9 % Alpaca Eval win rate

LISA k=8

84.5 % Alpaca Eval win rate

Full FT

85.2 % Alpaca Eval win rate

LISA with $k = 4$ (5% of layers active) slightly outperforms LoRA at $r = 16$ . With $k = 8$ (10% of layers), it approaches full fine-tuning quality. The key tradeoff: LISA’s memory is higher than LoRA (164 GB vs 141 GB for 70B) but lower than full fine-tuning (1,120 GB), and it produces standard model weights with no inference overhead.

6. Comprehensive Comparison

📊

Method Comparison (Llama 70B, Standard Settings)

Method	Memory	Quality (relative)	Inference Overhead	Adapter at Inference?
Full FT	1,120 GB	100% (baseline)	None	No
LoRA r=16	141 GB	97.6%	+2-5% (extra matmul)	Yes (or merge)
DoRA r=16	141 GB	98.2%	+3-6% (extra matmul + norm)	Yes (or merge)
QLoRA r=16	43 GB	96.8%	+2-5% (if merged to FP16)	Yes (or merge)
GaLore r=128	298 GB	99.1%	None	No
LISA k=4	165 GB	98.5%	None	No

Memory Breakdown by Method (Llama 70B)

Full FT: 1,120 GB Weights (140) + Grads (140) + Optim (840) 16 bytes/param. Needs 16+ GPUs (80GB each).

LoRA r=16: 141 GB Frozen base (140) + Adapter train state (1.34) 2 GPUs. 0.12% trainable params.

QLoRA r=16: 43 GB NF4 base (35.6) + Adapter train state (1.34) + Activations (6.4) 1 GPU (48GB). 4-bit base model.

GaLore r=128: 298 GB Base (140) + Full grads (140) + Projected optim (0.07) + Proj matrices (18.35) 4 GPUs. Full-weight updates, no adapter.

LISA k=4: 165 GB Base (140) + Partial grads (3.5) + Partial optim (21) 3 GPUs. Random layer selection.

7. Decision Matrix

The method selection depends on four factors: available GPU memory, quality requirements, whether you need adapter-based serving (multi-tenant), and training throughput requirements.

def select_finetuning_method(
    model_size_billions,
    available_gpu_memory_gb,
    num_gpus,
    quality_requirement,       # "maximum", "high", "acceptable"
    multi_tenant_serving,      # True if serving many fine-tuned variants
    training_throughput_priority,  # "high", "medium", "low"
):
    """
    Decision function for selecting PEFT method.
    """
    total_memory_gb = available_gpu_memory_gb * num_gpus
    base_fp16_gb = model_size_billions * 2  # GB

    # Can we even fit the base model?
    if total_memory_gb < base_fp16_gb * 0.3:
        return "Model too large for available hardware"

    # QLoRA: when memory is extremely constrained
    base_nf4_gb = model_size_billions * 0.5
    if total_memory_gb < base_fp16_gb + 5:
        if total_memory_gb >= base_nf4_gb + 10:
            return "QLoRA"
        return "Insufficient memory"

    # Multi-tenant serving: LoRA/DoRA (adapter-based)
    if multi_tenant_serving:
        if quality_requirement == "maximum":
            return "DoRA"
        return "LoRA"

    # Maximum quality, sufficient memory for GaLore
    full_ft_gb = model_size_billions * 16
    galore_gb = base_fp16_gb * 2 + 20  # Rough estimate
    lisa_gb = base_fp16_gb + 25         # Rough estimate

    if quality_requirement == "maximum":
        if total_memory_gb >= full_ft_gb:
            return "Full Fine-Tuning"
        if total_memory_gb >= galore_gb:
            return "GaLore"
        if total_memory_gb >= lisa_gb:
            return "LISA k=8"
        return "DoRA"

    if quality_requirement == "high":
        if total_memory_gb >= lisa_gb:
            return "LISA k=4"
        return "DoRA"

    # Acceptable quality, minimize cost
    if total_memory_gb >= base_nf4_gb + 10:
        return "QLoRA"
    return "LoRA"

📊

Decision Matrix Summary

Scenario	Recommended Method	Reason
1 GPU (48GB), 70B model	QLoRA	Only method that fits
2 GPUs (80GB), 70B, multi-tenant	LoRA or DoRA	Adapter-based serving needed
4 GPUs (80GB), 70B, max quality, no adapters	GaLore or LISA	Full-weight update, no inference overhead
16+ GPUs, 70B, max quality	Full Fine-Tuning	Best quality, sufficient memory
1 GPU (24GB), 7B model	QLoRA	Memory constrained
1 GPU (80GB), 7B model, multi-tenant	LoRA	14 GB base + 0.3 GB adapter; plenty of room
8 GPUs (80GB), 70B, fast iteration	LoRA	Fastest training (fewest backward params)

Training Throughput

Training speed varies significantly across methods because the backward pass cost scales with the number of parameters that require gradients:

Training Throughput (Llama 7B, 1x A100 80GB, batch=4)

(% of Full FT throughput)

Full FT (baseline) ZeRO-3 needed

100 % of Full FT throughput

LoRA r=16 60% faster

160 % of Full FT throughput

DoRA r=16 45% faster

145 % of Full FT throughput

QLoRA r=16 Deq overhead

130 % of Full FT throughput

GaLore r=128 SVD overhead

85 % of Full FT throughput

LISA k=4 40% faster

140 % of Full FT throughput

LoRA is the fastest because it computes gradients for only 0.12% of parameters. The backward pass for frozen layers still occurs (to propagate gradients to earlier LoRA modules), but no weight updates or optimizer steps are needed for frozen parameters.

GaLore is slower than full fine-tuning because of the SVD computation. LISA is faster than full fine-tuning because only $k/L$ of the backward pass computes weight gradients (though the activation gradient still propagates through all layers for the forward dependency).

8. Combining Methods

Methods can be combined:

QLoRA + DoRA: Apply DoRA’s magnitude-direction decomposition on top of a 4-bit quantized base model. The magnitude vector is FP16 (8192 extra params per matrix), the direction LoRA adapters are FP16, and the base model is NF4. Memory cost is nearly identical to QLoRA with a ~0.5% quality improvement.

LISA + LoRA: Randomly select which layers get full-weight updates (LISA), apply LoRA to the remaining layers. This gives the selected layers full expressiveness while keeping memory bounded. Memory is similar to LISA alone, but the LoRA adapters on non-selected layers provide continuous adaptation.

GaLore + Quantized Base: Apply GaLore’s gradient projection with an INT8-quantized base model. The base model shrinks from 140 GB to 70 GB (INT8), and the optimizer state remains tiny ( $r \times r$ per matrix). This brings GaLore to approximately 210 GB for a 70B model — closer to the 3-GPU range.

💡 Practical Recommendation for 2025-2026

For most production fine-tuning in 2025: use QLoRA for experimentation (fast iteration on 1 GPU), DoRA for production quality (2+ GPUs), and GaLore when you need full-weight updates without adapters. LISA is the right choice when you have moderate GPU memory and need standard model weights at inference (no adapter management overhead).

Reviewer Agent Validation

Challenge: Using only this post, implement a minimal LoRALinear class and demonstrate that at initialization ( $B = 0$ ), the output equals the base model’s output exactly. Then apply one gradient step and show the output diverges.

Expected test:

import torch

d_in, d_out, rank = 64, 64, 4
base = torch.nn.Linear(d_in, d_out, bias=False)
lora = LoRALinear(base, rank=rank, alpha=rank)

x = torch.randn(1, d_in)

# At init: B=0, so LoRA contribution is zero
out_base = base(x)
out_lora = lora.forward(x)
assert torch.allclose(out_base, out_lora, atol=1e-6), (
    f"At init, LoRA output should equal base output. "
    f"Diff: {(out_base - out_lora).abs().max().item()}"
)

# Simulate one gradient step: set B to non-zero
with torch.no_grad():
    lora.lora_B.fill_(0.1)

out_after = lora.forward(x)
diff = (out_after - out_base).abs().max().item()
assert diff > 0.001, (
    f"After updating B, output should diverge from base. Diff: {diff}"
)
# The difference should be approximately: scaling * ||B @ A @ x||
# With B=0.1 (all entries), A ~ Kaiming init, x ~ N(0,1):
# expected diff ~ (alpha/rank) * 0.1 * sqrt(d_in) * sqrt(rank) ~ O(1)

If the Reviewer Agent can implement LoRALinear with correct zero-initialization of $B$ , Kaiming initialization of $A$ , the scaling factor $\alpha / r$ , and the forward pass $W_0 x + (\alpha/r) B A x$ , and both assertions pass, the LoRA formulation was explained with sufficient precision.

1. LoRA: Low-Rank Adaptation

The Math

Parameter Count

Memory Budget

LoRA Memory Budget (Llama 70B)

Implementation

Rank Selection

LoRA Quality vs Rank (Llama 7B, Alpaca Eval)

2. DoRA: Weight-Decomposed Low-Rank Adaptation

The Insight

Implementation

DoRA vs LoRA: Parameter Overhead

DoRA vs LoRA Quality Comparison

3. QLoRA: Quantized Low-Rank Adaptation

The Key Idea

NF4 Quantization

Double Quantization

QLoRA Forward Pass

QLoRA Memory Layout (Llama 70B)

Memory Comparison: Full FT vs LoRA vs QLoRA (Llama 70B)

4. GaLore: Gradient Low-Rank Projection

The Key Idea

Memory Analysis

Implementation

GaLore vs LoRA: Key Differences

GaLore vs LoRA Structural Comparison

GaLore Memory Budget

5. LISA: Layerwise Importance Sampled Adaptation

The Key Idea

Implementation

Memory Analysis

LISA vs LoRA Quality

LISA vs LoRA vs Full FT Quality (Llama 7B)

6. Comprehensive Comparison

Method Comparison (Llama 70B, Standard Settings)

Memory Breakdown by Method (Llama 70B)

7. Decision Matrix

Decision Matrix Summary

Training Throughput

Training Throughput (Llama 7B, 1x A100 80GB, batch=4)

8. Combining Methods

Reviewer Agent Validation

Stanley Phoong

Related Posts

LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge

Encoder vs Decoder: Why Decoder-Only Won

Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output