Full fine-tuning a 70B parameter model requires storing the model weights, the gradients, and the optimizer states in memory simultaneously. With AdamW (the standard optimizer for LLM fine-tuning), each parameter requires:
- 2 bytes for the FP16 weight
- 2 bytes for the FP16 gradient
- 4 bytes for the FP32 master weight (mixed precision)
- 4 bytes for the FP32 first moment (Adam )
- 4 bytes for the FP32 second moment (Adam )
Total: 16 bytes per parameter. For 70 billion parameters:
No single GPU has 1.12 TB of memory. An 8-GPU node with 80 GB H100s provides 640 GB total. Even with DeepSpeed ZeRO Stage 3 sharding across all 8 GPUs, each GPU holds 140 GB of state — still exceeding the 80 GB limit. Full fine-tuning a 70B model requires at minimum 16 H100s (1.28 TB total), and realistically 32 GPUs after accounting for activations and batch data.
Parameter-efficient fine-tuning (PEFT) methods reduce the memory requirement by orders of magnitude. This post covers five methods — LoRA, DoRA, QLoRA, GaLore, and LISA — with the math, the memory budgets, the quality tradeoffs, and a decision matrix for when to use each.
1. LoRA: Low-Rank Adaptation
The Math
A pretrained weight matrix is frozen. The fine-tuned weight is:
where and , with rank .
During the forward pass:
is initialized to zeros, is initialized with Kaiming uniform. This ensures at the start of training — the model begins from the pretrained behavior.
A scaling factor is applied:
Typically (so the effective scale is 1.0), but can be tuned independently. Higher amplifies the adapter’s contribution; lower keeps the model closer to pretrained weights.
Parameter Count
For Llama 70B with , LoRA is typically applied to the attention weight matrices: , , , . Each has dimensions (ignoring GQA for now).
Per adapted matrix at rank :
Across 4 attention matrices, 80 layers:
That is 0.12% of the base model’s 70B parameters.
Memory Budget
def lora_memory_budget(
model_params_billions,
rank,
num_adapted_matrices_per_layer,
num_layers,
d_model,
base_dtype_bytes=2, # FP16
adapter_dtype_bytes=2, # FP16
):
"""
Calculate total memory for LoRA fine-tuning.
"""
# Base model: frozen, no gradients or optimizer states
base_model_bytes = model_params_billions * 1e9 * base_dtype_bytes
# LoRA adapter parameters
params_per_matrix = 2 * d_model * rank
total_adapter_params = (
num_adapted_matrices_per_layer * num_layers * params_per_matrix
)
# Adapter weights (FP16)
adapter_weight_bytes = total_adapter_params * adapter_dtype_bytes
# Adapter gradients (FP16)
adapter_grad_bytes = total_adapter_params * adapter_dtype_bytes
# Optimizer states: master weights (FP32) + m (FP32) + v (FP32)
adapter_optimizer_bytes = total_adapter_params * (4 + 4 + 4)
total_adapter_bytes = (
adapter_weight_bytes + adapter_grad_bytes + adapter_optimizer_bytes
)
return {
"base_model_gb": base_model_bytes / 1e9,
"adapter_total_gb": total_adapter_bytes / 1e9,
"total_gb": (base_model_bytes + total_adapter_bytes) / 1e9,
"adapter_params_millions": total_adapter_params / 1e6,
}
# Llama 70B, rank 16, 4 attention matrices, 80 layers
budget = lora_memory_budget(70, 16, 4, 80, 8192)
# base_model_gb: 140.0
# adapter_total_gb: 1.34
# total_gb: 141.34
# adapter_params_millions: 83.9
LoRA Memory Budget (Llama 70B)
| Component | Size | Percentage |
|---|---|---|
| Base model (FP16, frozen) | 140.0 GB | 99.05% |
| Adapter weights (FP16) | 0.168 GB | 0.12% |
| Adapter gradients (FP16) | 0.168 GB | 0.12% |
| Adapter optimizer (FP32, m+v+master) | 1.006 GB | 0.71% |
| Total | 141.34 GB | 100% |
LoRA reduces the trainable state from 1.12 TB (full fine-tuning) to 1.34 GB. The base model still requires 140 GB, but it needs no gradients or optimizer states.
Implementation
class LoRALinear:
"""LoRA-adapted linear layer."""
def __init__(self, base_linear, rank, alpha):
self.base = base_linear # Frozen
self.rank = rank
self.alpha = alpha
self.scaling = alpha / rank
d_out, d_in = base_linear.weight.shape
# A: [rank, d_in], initialized with Kaiming uniform
self.lora_A = torch.nn.Parameter(
torch.empty(rank, d_in)
)
torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# B: [d_out, rank], initialized to zero
self.lora_B = torch.nn.Parameter(
torch.zeros(d_out, rank)
)
def forward(self, x):
# Base forward (no gradient through base weights)
with torch.no_grad():
base_out = self.base(x)
# LoRA forward
lora_out = (x @ self.lora_A.T) @ self.lora_B.T
return base_out + self.scaling * lora_out
Rank Selection
The rank controls the expressiveness-efficiency tradeoff:
LoRA Quality vs Rank (Llama 7B, Alpaca Eval)
(% win rate)The gap between and full fine-tuning is 2.0 percentage points. Between and is 0.9 points. The standard recommendation: start with , increase to 64 only if quality is insufficient.
2. DoRA: Weight-Decomposed Low-Rank Adaptation
The Insight
DoRA (Liu et al., 2024) observes that LoRA’s updates to change both the magnitude and direction of the weight vectors simultaneously. Full fine-tuning tends to change direction more than magnitude. DoRA decouples the two: apply LoRA only to the direction component.
Decompose each column of (or equivalently, each row depending on convention) into magnitude and direction:
where is a learnable magnitude vector (one scalar per output dimension), is the direction matrix, and denotes the column-wise norm.
The fine-tuned weight becomes:
where (standard LoRA applied to the direction matrix ), and is a fine-tuned magnitude vector.
Implementation
class DoRALinear:
"""DoRA: Weight-Decomposed Low-Rank Adaptation."""
def __init__(self, base_linear, rank, alpha):
self.rank = rank
self.scaling = alpha / rank
d_out, d_in = base_linear.weight.shape
W = base_linear.weight.data # [d_out, d_in]
# Decompose pretrained weight into magnitude and direction
# Column-wise norm (norm of each row for [d_out, d_in] layout)
col_norms = torch.norm(W, dim=1, keepdim=True) # [d_out, 1]
# m: learnable magnitude vector, initialized from pretrained norms
self.magnitude = torch.nn.Parameter(
col_norms.squeeze() # [d_out]
)
# V: direction matrix (normalized columns of W)
# V is NOT a parameter; it's derived from the frozen base weight
self.base_weight = W # Frozen
# LoRA adapters for the direction update delta_V = B @ A
self.lora_A = torch.nn.Parameter(torch.empty(rank, d_in))
torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
self.lora_B = torch.nn.Parameter(torch.zeros(d_out, rank))
def forward(self, x):
# Compute V + delta_V
delta_V = self.lora_B @ self.lora_A # [d_out, d_in]
V_prime = self.base_weight + self.scaling * delta_V
# Normalize: direction only
V_prime_norm = torch.norm(V_prime, dim=1, keepdim=True) # [d_out, 1]
V_prime_normalized = V_prime / V_prime_norm.clamp(min=1e-8)
# Apply magnitude
W_prime = self.magnitude.unsqueeze(1) * V_prime_normalized # [d_out, d_in]
return x @ W_prime.T
DoRA vs LoRA: Parameter Overhead
DoRA adds learnable parameters per adapted matrix (the magnitude vector ). For :
The magnitude vector is negligible. DoRA’s memory cost is essentially identical to LoRA.
DoRA vs LoRA Quality Comparison
| Benchmark | LoRA (r=16) | DoRA (r=16) | Full FT | DoRA Gain over LoRA |
|---|---|---|---|---|
| Alpaca Eval (Llama-7B) | 83.2% | 83.7% | 85.2% | +0.5% |
| MMLU (Llama-7B) | 52.8% | 53.4% | 54.1% | +0.6% |
| GSM8K (Llama-7B) | 38.4% | 39.1% | 40.2% | +0.7% |
| HumanEval (Code Llama-7B) | 32.9% | 33.6% | 35.4% | +0.7% |
| Avg across 8 benchmarks | — | — | — | +0.5% |
DoRA consistently outperforms LoRA by 0.5-0.7% across benchmarks at the same rank. The gain is small but free — no additional memory, minimal compute overhead (one extra normalization per forward pass).
The 0.5% average gain sounds small, but on tasks where direction change dominates (math reasoning, code generation), DoRA’s advantage can reach 0.7-1.0%. The magnitude-direction decomposition better preserves the pretrained model’s learned feature scales while allowing the direction (which encodes semantic relationships) to adapt. For instruction-following tasks where the change from pretrained is mostly stylistic (small direction change, large magnitude change), LoRA and DoRA perform similarly.
3. QLoRA: Quantized Low-Rank Adaptation
The Key Idea
QLoRA (Dettmers et al., 2023) quantizes the frozen base model to 4-bit NF4 (Normal Float 4-bit) and keeps the LoRA adapters in FP16. The base model shrinks from 2 bytes per parameter to 0.5 bytes, reducing the dominant memory term by 4x.
For Llama 70B:
Adding LoRA adapters (1.34 GB) and activations (~5-10 GB):
A 70B model fine-tuned on a single 48 GB GPU (A6000 or L40S). That is the practical impact.
NF4 Quantization
NF4 is not uniform quantization. It assumes the weight distribution is approximately normal (which empirical evidence supports for transformer weights). The 16 quantization levels are spaced according to the quantiles of a standard normal distribution:
def compute_nf4_levels():
"""
Compute the 16 NF4 quantization levels.
These are the quantiles of N(0, 1) that divide the distribution
into 16 equal-probability bins, plus adjustments for the tails.
"""
import scipy.stats as stats
num_levels = 16
# 16 quantiles dividing N(0,1) into equal-probability regions
# Actually NF4 uses 2^4 = 16 levels with specific placement
# Half for negative, half for positive, zero-centered
levels = []
for i in range(8):
# Negative half
q = (2 * i + 1) / (2 * 16)
levels.append(stats.norm.ppf(q))
for i in range(8):
# Positive half
q = 0.5 + (2 * i + 1) / (2 * 16)
levels.append(stats.norm.ppf(q))
levels = sorted(levels)
# Normalize so max absolute value = 1
max_abs = max(abs(l) for l in levels)
levels = [l / max_abs for l in levels]
return levels
NF4_LEVELS = compute_nf4_levels()
# Approximately: [-1.0, -0.69, -0.52, -0.39, -0.28, -0.18, -0.09, 0.0,
# 0.08, 0.17, 0.27, 0.38, 0.51, 0.68, 0.91, 1.0]
Each weight is quantized to the nearest NF4 level, then stored as a 4-bit index. The dequantization lookup table is 16 values per quantization group (typically 64 or 128 weights share one scale factor):
def quantize_nf4(tensor, group_size=64):
"""Quantize a tensor to NF4 format."""
flat = tensor.reshape(-1)
num_groups = len(flat) // group_size
flat = flat[:num_groups * group_size].reshape(num_groups, group_size)
# Per-group absmax scaling
scales = flat.abs().max(dim=1, keepdim=True).values # [num_groups, 1]
normalized = flat / scales.clamp(min=1e-8) # [-1, 1]
# Find nearest NF4 level for each element
nf4 = torch.tensor(NF4_LEVELS, dtype=tensor.dtype)
# distances[i][j] = |normalized[i] - nf4[j]|
distances = (normalized.unsqueeze(-1) - nf4.unsqueeze(0).unsqueeze(0)).abs()
indices = distances.argmin(dim=-1) # 4-bit indices [0..15]
return QuantizedTensor(
indices=indices.to(torch.uint8), # Pack two 4-bit values per byte
scales=scales.squeeze(),
group_size=group_size,
)
def dequantize_nf4(qtensor):
"""Dequantize NF4 back to floating point for the forward pass."""
nf4 = torch.tensor(NF4_LEVELS, dtype=torch.float16)
values = nf4[qtensor.indices.long()] # Lookup: index -> NF4 level
values = values * qtensor.scales.unsqueeze(1) # Rescale
return values.reshape(-1)
Double Quantization
QLoRA applies a second level of quantization to the scale factors themselves. Each group’s FP32 scale factor is quantized to FP8, reducing the scale storage from 4 bytes to 1 byte per group. For group size 64:
Double quantization saves 9.4% on the scale overhead, which for a 70B model translates to:
QLoRA Forward Pass
class QLoRALinear:
"""QLoRA: 4-bit NF4 base + FP16 LoRA adapters."""
def __init__(self, base_weight_quantized, rank, alpha):
self.base_q = base_weight_quantized # NF4 QuantizedTensor
self.rank = rank
self.scaling = alpha / rank
d_out, d_in = base_weight_quantized.shape
self.lora_A = torch.nn.Parameter(
torch.empty(rank, d_in, dtype=torch.float16)
)
torch.nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
self.lora_B = torch.nn.Parameter(
torch.zeros(d_out, rank, dtype=torch.float16)
)
def forward(self, x):
# Dequantize base weight on-the-fly (NF4 -> FP16)
W_base = dequantize_nf4(self.base_q).reshape(
self.base_q.shape
).to(x.dtype)
# Base forward
base_out = x @ W_base.T
# LoRA forward (FP16 throughout)
lora_out = (x @ self.lora_A.T) @ self.lora_B.T
return base_out + self.scaling * lora_out
Dequantizing NF4 to FP16 during the forward pass adds compute overhead. For each weight element: one table lookup (4-bit index to FP16 value) and one multiplication (by the group scale). On H100, this adds approximately 5-10% to the forward pass time compared to FP16 base weights. The memory savings (4x) far outweigh the compute cost.
QLoRA Memory Layout (Llama 70B)
Memory Comparison: Full FT vs LoRA vs QLoRA (Llama 70B)
| Method | Base Model | Trainable State | Total | Min GPUs (80GB) |
|---|---|---|---|---|
| Full Fine-Tuning (FP16+FP32) | 140 GB | 980 GB (grads+optim) | 1,120 GB | 16 (ZeRO-3) |
| LoRA (FP16 base, r=16) | 140 GB | 1.34 GB | 141.34 GB | 2 |
| QLoRA (NF4 base, r=16) | 35.6 GB | 1.34 GB | 43.34 GB | 1 |
| QLoRA (NF4 base, r=64) | 35.6 GB | 5.37 GB | 47.37 GB | 1 |
4. GaLore: Gradient Low-Rank Projection
The Key Idea
GaLore (Zhao et al., 2024) takes a fundamentally different approach. Instead of constraining the weight update to low-rank (like LoRA), it constrains the gradient to low-rank during training but updates the full weight matrix. At inference, there is no adapter — the weights are the full fine-tuned weights.
The gradient matrix at each step is projected into a low-rank subspace:
where and are projection matrices obtained from the SVD of (computed periodically, not every step).
The optimizer operates on the projected gradient instead of the full . This reduces the optimizer state from to per weight matrix.
Memory Analysis
For a weight matrix of dimensions with rank :
Full fine-tuning optimizer state:
GaLore optimizer state:
For and :
A reduction in optimizer state per weight matrix.
Implementation
class GaLoreOptimizer:
"""
GaLore: Gradient Low-Rank Projection for memory-efficient training.
Updates full weights but stores optimizer state in low-rank form.
"""
def __init__(self, params, lr, rank, svd_interval=200):
self.lr = lr
self.rank = rank
self.svd_interval = svd_interval # Recompute projections every N steps
self.step_count = 0
# Per-parameter state
self.state = {}
for p in params:
self.state[p] = {
"P": None, # Left projection [d_out, rank]
"Q": None, # Right projection [d_in, rank]
"m": None, # First moment in projected space [rank, rank]
"v": None, # Second moment in projected space [rank, rank]
"last_svd_step": -1,
}
def step(self):
self.step_count += 1
for p in self.state:
if p.grad is None:
continue
s = self.state[p]
G = p.grad.data # [d_out, d_in]
# Periodically recompute projection matrices via SVD
if (self.step_count - s["last_svd_step"]) >= self.svd_interval:
U, S_vals, Vt = torch.linalg.svd(G, full_matrices=False)
s["P"] = U[:, :self.rank] # [d_out, rank]
s["Q"] = Vt[:self.rank, :].T # [d_in, rank]
s["last_svd_step"] = self.step_count
# Reset optimizer state when subspace changes
s["m"] = torch.zeros(self.rank, self.rank, device=p.device)
s["v"] = torch.zeros(self.rank, self.rank, device=p.device)
# Project gradient: G_proj = P^T @ G @ Q [rank, rank]
G_proj = s["P"].T @ G @ s["Q"]
# Adam update in projected space
s["m"] = 0.9 * s["m"] + 0.1 * G_proj
s["v"] = 0.999 * s["v"] + 0.001 * G_proj ** 2
m_hat = s["m"] / (1 - 0.9 ** self.step_count)
v_hat = s["v"] / (1 - 0.999 ** self.step_count)
update_proj = m_hat / (v_hat.sqrt() + 1e-8)
# Project back to full space: delta_W = P @ update_proj @ Q^T
delta_W = s["P"] @ update_proj @ s["Q"].T
# Update full weight
p.data -= self.lr * delta_W
GaLore vs LoRA: Key Differences
GaLore vs LoRA Structural Comparison
| Property | LoRA | GaLore |
|---|---|---|
| What is low-rank? | The weight update (delta_W = BA) | The gradient projection |
| Final weights | W + BA (adapter required) | Full W (no adapter) |
| Inference overhead | Extra matmul per layer (or merge) | None |
| Where memory is saved | Fewer trainable params -> smaller optim state | Smaller projected optim state |
| SVD computation | None | Every N steps (expensive) |
| Serving complexity | Adapter loading/switching | Standard model serving |
Computing the SVD of the gradient matrix every 200 steps is expensive. For a matrix, truncated SVD to rank 128 takes approximately 200 ms on an H100. Across 320 weight matrices (4 attention + 3 FFN per layer, 80 layers for a ~70B model if applied to all), that is seconds every 200 steps, or 0.32 seconds per step amortized. For a training step that takes 5-10 seconds, this is 3-6% overhead. Acceptable, but not free.
GaLore Memory Budget
def galore_memory_budget(
model_params_billions,
rank,
num_adapted_matrices_per_layer,
num_layers,
d_model,
):
base_bytes = model_params_billions * 1e9 * 2 # FP16 weights
gradient_bytes = model_params_billions * 1e9 * 2 # FP16 gradients (full)
# Optimizer state: rank x rank per adapted matrix
optim_per_matrix = rank * rank * (4 + 4) # m + v in FP32
total_optim = num_adapted_matrices_per_layer * num_layers * optim_per_matrix
# Projection matrices: P [d_out, rank] + Q [d_in, rank] per matrix, FP16
proj_per_matrix = 2 * d_model * rank * 2
total_proj = num_adapted_matrices_per_layer * num_layers * proj_per_matrix
return {
"base_model_gb": base_bytes / 1e9,
"gradients_gb": gradient_bytes / 1e9,
"optimizer_gb": total_optim / 1e9,
"projections_gb": total_proj / 1e9,
"total_gb": (base_bytes + gradient_bytes + total_optim + total_proj) / 1e9,
}
# Llama 70B, rank 128, 7 matrices per layer, 80 layers
budget = galore_memory_budget(70, 128, 7, 80, 8192)
# base_model_gb: 140.0
# gradients_gb: 140.0
# optimizer_gb: 0.073 GB (vs 980 GB for full Adam!)
# projections_gb: 18.35 GB
# total_gb: 298.42 GB
GaLore’s total memory is 298 GB — far less than full fine-tuning’s 1,120 GB but more than LoRA’s 141 GB. The bottleneck is that GaLore still needs full-precision gradients for the SVD projection. This requires the full gradient tensor to be materialized, even though the optimizer only operates on the projected version.
5. LISA: Layerwise Importance Sampled Adaptation
The Key Idea
LISA (Pan et al., 2024) observes that not all layers need to be updated every training step. At each step, LISA randomly selects a small subset of layers to unfreeze (compute gradients and update weights), while all other layers remain frozen. This is conceptually simple: randomly skip most of the backward pass.
The algorithm:
- At each training step, sample layers uniformly at random from the total layers.
- Only compute gradients for the selected layers.
- Update weights of the selected layers using standard Adam.
- The remaining layers contribute to the forward pass but not the backward pass.
Implementation
class LISATrainer:
"""
LISA: Layerwise Importance Sampled Adaptation.
Randomly freeze most layers each step.
"""
def __init__(self, model, optimizer, num_active_layers, total_layers):
self.model = model
self.optimizer = optimizer
self.num_active = num_active_layers # k
self.total_layers = total_layers # L
# Always unfreeze: embedding layer and output head
self.always_active = ["embed_tokens", "lm_head"]
def train_step(self, batch):
# Step 1: Randomly select k layers to unfreeze
layer_indices = list(range(self.total_layers))
active_indices = set(random.sample(layer_indices, self.num_active))
# Step 2: Set requires_grad based on selection
for idx, layer in enumerate(self.model.layers):
for param in layer.parameters():
param.requires_grad = (idx in active_indices)
# Always-active components
for name in self.always_active:
module = getattr(self.model, name)
for param in module.parameters():
param.requires_grad = True
# Step 3: Forward pass (all layers participate)
outputs = self.model(**batch)
loss = outputs.loss
# Step 4: Backward pass (only active layers compute gradients)
loss.backward()
# Step 5: Optimizer step (only updates params with gradients)
self.optimizer.step()
self.optimizer.zero_grad()
return loss.item()
Memory Analysis
With active layers out of total, LISA stores optimizer states for only layers worth of parameters. For Llama 70B with and :
But LISA must also store FP16 gradients for the active layers and the always-active components:
Total:
LISA updates the full model weights (for the sampled layers). At inference, there is no adapter to load, merge, or switch. The model serves exactly as a standard model. This eliminates the operational complexity of adapter management, which matters in production deployments serving hundreds of fine-tuned variants.
LISA vs LoRA Quality
LISA vs LoRA vs Full FT Quality (Llama 7B)
(% Alpaca Eval win rate)LISA with (5% of layers active) slightly outperforms LoRA at . With (10% of layers), it approaches full fine-tuning quality. The key tradeoff: LISA’s memory is higher than LoRA (164 GB vs 141 GB for 70B) but lower than full fine-tuning (1,120 GB), and it produces standard model weights with no inference overhead.
6. Comprehensive Comparison
Method Comparison (Llama 70B, Standard Settings)
| Method | Memory | Quality (relative) | Inference Overhead | Adapter at Inference? |
|---|---|---|---|---|
| Full FT | 1,120 GB | 100% (baseline) | None | No |
| LoRA r=16 | 141 GB | 97.6% | +2-5% (extra matmul) | Yes (or merge) |
| DoRA r=16 | 141 GB | 98.2% | +3-6% (extra matmul + norm) | Yes (or merge) |
| QLoRA r=16 | 43 GB | 96.8% | +2-5% (if merged to FP16) | Yes (or merge) |
| GaLore r=128 | 298 GB | 99.1% | None | No |
| LISA k=4 | 165 GB | 98.5% | None | No |
Memory Breakdown by Method (Llama 70B)
7. Decision Matrix
The method selection depends on four factors: available GPU memory, quality requirements, whether you need adapter-based serving (multi-tenant), and training throughput requirements.
def select_finetuning_method(
model_size_billions,
available_gpu_memory_gb,
num_gpus,
quality_requirement, # "maximum", "high", "acceptable"
multi_tenant_serving, # True if serving many fine-tuned variants
training_throughput_priority, # "high", "medium", "low"
):
"""
Decision function for selecting PEFT method.
"""
total_memory_gb = available_gpu_memory_gb * num_gpus
base_fp16_gb = model_size_billions * 2 # GB
# Can we even fit the base model?
if total_memory_gb < base_fp16_gb * 0.3:
return "Model too large for available hardware"
# QLoRA: when memory is extremely constrained
base_nf4_gb = model_size_billions * 0.5
if total_memory_gb < base_fp16_gb + 5:
if total_memory_gb >= base_nf4_gb + 10:
return "QLoRA"
return "Insufficient memory"
# Multi-tenant serving: LoRA/DoRA (adapter-based)
if multi_tenant_serving:
if quality_requirement == "maximum":
return "DoRA"
return "LoRA"
# Maximum quality, sufficient memory for GaLore
full_ft_gb = model_size_billions * 16
galore_gb = base_fp16_gb * 2 + 20 # Rough estimate
lisa_gb = base_fp16_gb + 25 # Rough estimate
if quality_requirement == "maximum":
if total_memory_gb >= full_ft_gb:
return "Full Fine-Tuning"
if total_memory_gb >= galore_gb:
return "GaLore"
if total_memory_gb >= lisa_gb:
return "LISA k=8"
return "DoRA"
if quality_requirement == "high":
if total_memory_gb >= lisa_gb:
return "LISA k=4"
return "DoRA"
# Acceptable quality, minimize cost
if total_memory_gb >= base_nf4_gb + 10:
return "QLoRA"
return "LoRA"
Decision Matrix Summary
| Scenario | Recommended Method | Reason |
|---|---|---|
| 1 GPU (48GB), 70B model | QLoRA | Only method that fits |
| 2 GPUs (80GB), 70B, multi-tenant | LoRA or DoRA | Adapter-based serving needed |
| 4 GPUs (80GB), 70B, max quality, no adapters | GaLore or LISA | Full-weight update, no inference overhead |
| 16+ GPUs, 70B, max quality | Full Fine-Tuning | Best quality, sufficient memory |
| 1 GPU (24GB), 7B model | QLoRA | Memory constrained |
| 1 GPU (80GB), 7B model, multi-tenant | LoRA | 14 GB base + 0.3 GB adapter; plenty of room |
| 8 GPUs (80GB), 70B, fast iteration | LoRA | Fastest training (fewest backward params) |
Training Throughput
Training speed varies significantly across methods because the backward pass cost scales with the number of parameters that require gradients:
Training Throughput (Llama 7B, 1x A100 80GB, batch=4)
(% of Full FT throughput)LoRA is the fastest because it computes gradients for only 0.12% of parameters. The backward pass for frozen layers still occurs (to propagate gradients to earlier LoRA modules), but no weight updates or optimizer steps are needed for frozen parameters.
GaLore is slower than full fine-tuning because of the SVD computation. LISA is faster than full fine-tuning because only of the backward pass computes weight gradients (though the activation gradient still propagates through all layers for the forward dependency).
8. Combining Methods
Methods can be combined:
QLoRA + DoRA: Apply DoRA’s magnitude-direction decomposition on top of a 4-bit quantized base model. The magnitude vector is FP16 (8192 extra params per matrix), the direction LoRA adapters are FP16, and the base model is NF4. Memory cost is nearly identical to QLoRA with a ~0.5% quality improvement.
LISA + LoRA: Randomly select which layers get full-weight updates (LISA), apply LoRA to the remaining layers. This gives the selected layers full expressiveness while keeping memory bounded. Memory is similar to LISA alone, but the LoRA adapters on non-selected layers provide continuous adaptation.
GaLore + Quantized Base: Apply GaLore’s gradient projection with an INT8-quantized base model. The base model shrinks from 140 GB to 70 GB (INT8), and the optimizer state remains tiny ( per matrix). This brings GaLore to approximately 210 GB for a 70B model — closer to the 3-GPU range.
For most production fine-tuning in 2025: use QLoRA for experimentation (fast iteration on 1 GPU), DoRA for production quality (2+ GPUs), and GaLore when you need full-weight updates without adapters. LISA is the right choice when you have moderate GPU memory and need standard model weights at inference (no adapter management overhead).
Reviewer Agent Validation
Challenge: Using only this post, implement a minimal LoRALinear class and demonstrate that at initialization (), the output equals the base model’s output exactly. Then apply one gradient step and show the output diverges.
Expected test:
import torch
d_in, d_out, rank = 64, 64, 4
base = torch.nn.Linear(d_in, d_out, bias=False)
lora = LoRALinear(base, rank=rank, alpha=rank)
x = torch.randn(1, d_in)
# At init: B=0, so LoRA contribution is zero
out_base = base(x)
out_lora = lora.forward(x)
assert torch.allclose(out_base, out_lora, atol=1e-6), (
f"At init, LoRA output should equal base output. "
f"Diff: {(out_base - out_lora).abs().max().item()}"
)
# Simulate one gradient step: set B to non-zero
with torch.no_grad():
lora.lora_B.fill_(0.1)
out_after = lora.forward(x)
diff = (out_after - out_base).abs().max().item()
assert diff > 0.001, (
f"After updating B, output should diverge from base. Diff: {diff}"
)
# The difference should be approximately: scaling * ||B @ A @ x||
# With B=0.1 (all entries), A ~ Kaiming init, x ~ N(0,1):
# expected diff ~ (alpha/rank) * 0.1 * sqrt(d_in) * sqrt(rank) ~ O(1)
If the Reviewer Agent can implement LoRALinear with correct zero-initialization of , Kaiming initialization of , the scaling factor , and the forward pass , and both assertions pass, the LoRA formulation was explained with sufficient precision.