Model merging combines multiple fine-tuned models into a single model without any additional training. You have a model fine-tuned for code generation, another fine-tuned for math reasoning, and a third fine-tuned for creative writing. Merging produces a single model that performs all three tasks. No GPU hours. No training data. Just weight arithmetic.
This works because fine-tuning a pretrained model moves its weights by a small amount β typically less than 1% of the original magnitude. These small perturbations (task vectors) encode task-specific knowledge and can be composed through addition, averaging, or more sophisticated methods like TIES and DARE.
This post covers every major merging technique with exact mathematics, implementation code, and empirical results showing when each method works and when it fails.
1. Linear Weight Averaging
1.1 Definition
The simplest merge: linearly interpolate the weights of two models.
where and are the weight tensors of models and , and controls the interpolation.
import torch
import copy
def linear_merge(model_a, model_b, alpha=0.5):
"""Merge two models by linear weight interpolation."""
merged = copy.deepcopy(model_a)
for (name_a, param_a), (name_b, param_b) in zip(
model_a.named_parameters(), model_b.named_parameters()
):
assert name_a == name_b, f"Parameter name mismatch: {name_a} vs {name_b}"
merged_param = alpha * param_a.data + (1 - alpha) * param_b.data
merged.state_dict()[name_a].copy_(merged_param)
return merged
def multi_model_average(models, weights=None):
"""Average N models with optional weights."""
if weights is None:
weights = [1.0 / len(models)] * len(models)
assert abs(sum(weights) - 1.0) < 1e-6, "Weights must sum to 1"
merged = copy.deepcopy(models[0])
for name, param in merged.named_parameters():
weighted_sum = torch.zeros_like(param.data)
for model, w in zip(models, weights):
weighted_sum += w * dict(model.named_parameters())[name].data
param.data.copy_(weighted_sum)
return merged
1.2 When Linear Averaging Works
Linear averaging works when the two models share the same loss basin β they occupy nearby regions of the loss landscape that are connected by a path of low loss. This happens when:
- Both models are fine-tuned from the same base model (same initialization)
- The fine-tuning is relatively short (weights have not diverged far)
- The tasks are not contradictory (code + math is fine; βalways say yesβ + βalways say noβ is not)
def verify_same_basin(model_a, model_b, base_model, eval_fn, n_points=11):
"""Check if two models are in the same loss basin by linear interpolation."""
alphas = torch.linspace(0, 1, n_points)
losses = []
for alpha in alphas:
merged = linear_merge(model_a, model_b, alpha=alpha.item())
loss = eval_fn(merged)
losses.append(loss)
print(f"alpha={alpha:.1f}: loss={loss:.4f}")
# If in same basin: losses form a smooth curve, no barrier
# If different basins: losses spike in the middle
max_loss = max(losses)
endpoint_avg = (losses[0] + losses[-1]) / 2
barrier = max_loss - endpoint_avg
print(f"\nLoss barrier: {barrier:.4f}")
if barrier < 0.5:
print("Models are in the same basin -- linear averaging is safe")
else:
print("Models are in different basins -- use task arithmetic instead")
return losses
1.3 When Linear Averaging Fails
Different base models: Merging Llama 3 8B fine-tuned on code with Mistral 7B fine-tuned on math produces garbage. The weight spaces are completely different.
Too much fine-tuning: Models that have been trained for many epochs on large datasets diverge from the base model significantly. The interpolation path crosses high-loss barriers.
Contradictory tasks: If model A learned to always output JSON and model B learned to always output prose, averaging their weights produces a model that outputs broken half-JSON.
Linear averaging requires identical architectures: same number of layers, same hidden dimensions, same vocabulary. You cannot merge a 7B model with a 13B model. You cannot merge models with different tokenizers (different embedding matrices). Even models with the same architecture but different vocabulary sizes (e.g., extended tokenizer) cannot be directly averaged.
2. Task Arithmetic
2.1 Task Vectors
A task vector is the difference between a fine-tuned model and the base model:
This vector encodes the knowledge learned during fine-tuning. It is typically sparse (most weights change very little) and small in magnitude (usually less than 1% of the weight values).
def compute_task_vector(finetuned_model, base_model):
"""Compute the task vector (weight difference) between fine-tuned and base."""
task_vector = {}
for (name, ft_param), (_, base_param) in zip(
finetuned_model.named_parameters(), base_model.named_parameters()
):
task_vector[name] = ft_param.data - base_param.data
return task_vector
def task_vector_stats(task_vector):
"""Analyze task vector properties."""
total_params = 0
total_nonzero = 0
total_magnitude = 0
for name, delta in task_vector.items():
n = delta.numel()
total_params += n
total_nonzero += (delta.abs() > 1e-8).sum().item()
total_magnitude += delta.abs().sum().item()
print(f"Total parameters: {total_params:,}")
print(f"Non-zero deltas: {total_nonzero:,} ({total_nonzero/total_params:.2%})")
print(f"Mean |delta|: {total_magnitude/total_params:.6f}")
print(f"Relative magnitude: {total_magnitude/total_params:.4%} of typical weight")
2.2 Task Vector Composition
Multiple task vectors can be combined to create a model with multiple capabilities:
where is a scaling factor that controls the strength of the combined task knowledge. This is called task arithmetic because you perform arithmetic operations on task vectors.
def task_arithmetic_merge(base_model, task_vectors, scaling_factor=1.0):
"""Merge by adding scaled task vectors to the base model."""
merged = copy.deepcopy(base_model)
for name, param in merged.named_parameters():
combined_delta = torch.zeros_like(param.data)
for tv in task_vectors:
combined_delta += tv[name]
param.data.add_(scaling_factor * combined_delta)
return merged
def task_arithmetic_merge_weighted(base_model, task_vectors, weights):
"""Merge with per-task-vector weights."""
merged = copy.deepcopy(base_model)
for name, param in merged.named_parameters():
combined_delta = torch.zeros_like(param.data)
for tv, w in zip(task_vectors, weights):
combined_delta += w * tv[name]
param.data.add_(combined_delta)
return merged
# Example: merge code, math, and writing capabilities
base = load_model("meta-llama/Llama-3-8B")
code_model = load_model("code-llama-8b")
math_model = load_model("math-llama-8b")
writing_model = load_model("writing-llama-8b")
tv_code = compute_task_vector(code_model, base)
tv_math = compute_task_vector(math_model, base)
tv_writing = compute_task_vector(writing_model, base)
# Equal weighting
merged = task_arithmetic_merge(
base, [tv_code, tv_math, tv_writing], scaling_factor=0.8
)
# Custom weighting (more code, less writing)
merged_custom = task_arithmetic_merge_weighted(
base, [tv_code, tv_math, tv_writing], weights=[0.5, 0.3, 0.2]
)
2.3 Task Vector Negation
An interesting property: negating a task vector removes the corresponding capability:
This can be used for βunlearningβ β removing specific knowledge or behaviors from a model.
def negate_task(base_model, task_vector, scaling_factor=1.0):
"""Remove a capability by subtracting its task vector."""
negated = copy.deepcopy(base_model)
for name, param in negated.named_parameters():
param.data.sub_(scaling_factor * task_vector[name])
return negated
# Example: remove toxicity learned during fine-tuning
# toxic_tv = compute_task_vector(toxic_finetuned, base)
# clean_model = negate_task(toxic_finetuned, toxic_tv, scaling_factor=0.5)
2.4 The Scaling Factor Problem
The scaling factor is critical. Too high: the merged model overshoots and produces degenerate outputs. Too low: the merged model retains insufficient task knowledge.
def sweep_scaling_factor(base_model, task_vectors, eval_fn, factors=None):
"""Find optimal scaling factor by grid search."""
if factors is None:
factors = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]
results = {}
for lam in factors:
merged = task_arithmetic_merge(base_model, task_vectors, scaling_factor=lam)
score = eval_fn(merged)
results[lam] = score
print(f"lambda={lam:.1f}: score={score:.4f}")
best_lambda = max(results, key=results.get)
print(f"\nBest scaling factor: {best_lambda}")
return best_lambda
Scaling Factor Effect on Merge Quality
| Lambda | Code (HumanEval) | Math (GSM8K) | Writing (AlpacaEval) | Average |
|---|---|---|---|---|
| 0.2 | 38.4 | 42.1 | 71.2 | 50.6 |
| 0.5 | 45.7 | 55.3 | 78.4 | 59.8 |
| 0.8 | 51.2 | 61.8 | 82.1 | 65.0 |
| 1.0 | 52.1 | 60.5 | 80.3 | 64.3 |
| 1.2 | 48.3 | 54.2 | 75.8 | 59.4 |
| 1.5 | 31.7 | 35.6 | 62.3 | 43.2 |
3. TIES-Merging: Resolving Conflicts
3.1 The Problem with Naive Addition
When adding multiple task vectors, conflicts arise. For a single weight parameter, task vector A might say βincrease by +0.05β while task vector B says βdecrease by -0.03β. Naive addition gives +0.02, but this might be worse than either individual change.
TIES-Merging (Yadav et al., 2023) addresses three specific problems:
- Redundant parameters: Most task vector entries are very small (noise, not signal)
- Sign conflicts: Different tasks pull the same weight in opposite directions
- Magnitude differences: Some tasks have larger magnitudes that dominate the merge
3.2 The TIES Algorithm
TIES has three steps: Trim, Initialize signs by Election, and merge with Selected signs.
Step 1: Trim β Set small-magnitude entries to zero. Only keep the top- largest changes.
Step 2: Elect Sign β For each parameter, count how many task vectors want to increase it vs decrease it (majority vote on sign).
Step 3: Disjoint Merge β Average only the task vectors that agree with the elected sign. Discard the rest.
def ties_merge(task_vectors, density=0.2, majority_sign_method="total"):
"""
TIES-Merging: Trim, Elect Sign, Disjoint Merge.
Args:
task_vectors: list of dicts {param_name: delta_tensor}
density: fraction of parameters to keep (top-k% by magnitude)
majority_sign_method: "total" (sum magnitudes) or "count" (count votes)
"""
merged_tv = {}
# Get parameter names from first task vector
param_names = list(task_vectors[0].keys())
for name in param_names:
deltas = [tv[name] for tv in task_vectors]
n_tasks = len(deltas)
# ===== Step 1: TRIM =====
# Keep only top-k% by magnitude in each task vector
trimmed = []
for delta in deltas:
threshold = delta.abs().quantile(1 - density)
mask = delta.abs() >= threshold
trimmed.append(delta * mask.float())
# ===== Step 2: ELECT SIGN =====
# For each parameter position, determine the majority sign
stacked = torch.stack(trimmed) # [n_tasks, *param_shape]
if majority_sign_method == "total":
# Weight by magnitude: sum of signed values
sign_vote = stacked.sum(dim=0)
else:
# Count method: number of positive vs negative
sign_vote = (stacked > 0).float().sum(dim=0) - \
(stacked < 0).float().sum(dim=0)
elected_sign = torch.sign(sign_vote)
# Where sign_vote is 0, default to positive
elected_sign[elected_sign == 0] = 1.0
# ===== Step 3: DISJOINT MERGE =====
# Average only task vectors that agree with elected sign
agree_sum = torch.zeros_like(deltas[0])
agree_count = torch.zeros_like(deltas[0])
for t in trimmed:
# Check agreement: same sign as elected, and non-zero
agrees = (torch.sign(t) == elected_sign) & (t != 0)
agree_sum += t * agrees.float()
agree_count += agrees.float()
# Average the agreeing values
merged_tv[name] = agree_sum / agree_count.clamp(min=1)
return merged_tv
def apply_ties_merge(base_model, task_vectors, scaling_factor=1.0, density=0.2):
"""Apply TIES merge to produce final model."""
merged_tv = ties_merge(task_vectors, density=density)
merged_model = copy.deepcopy(base_model)
for name, param in merged_model.named_parameters():
param.data.add_(scaling_factor * merged_tv[name])
return merged_model
3.3 Why Each Step Matters
def ablation_ties_steps(base_model, task_vectors, eval_fn):
"""Ablation: contribution of each TIES step."""
results = {}
# 1. Naive sum (no TIES)
naive_tv = {}
for name in task_vectors[0]:
naive_tv[name] = sum(tv[name] for tv in task_vectors)
naive_model = copy.deepcopy(base_model)
for name, p in naive_model.named_parameters():
p.data.add_(naive_tv[name])
results['naive'] = eval_fn(naive_model)
# 2. Trim only
trimmed_tvs = []
for tv in task_vectors:
trimmed = {}
for name, delta in tv.items():
threshold = delta.abs().quantile(0.8)
mask = delta.abs() >= threshold
trimmed[name] = delta * mask.float()
trimmed_tvs.append(trimmed)
trim_tv = {}
for name in task_vectors[0]:
trim_tv[name] = sum(t[name] for t in trimmed_tvs) / len(trimmed_tvs)
trim_model = copy.deepcopy(base_model)
for name, p in trim_model.named_parameters():
p.data.add_(trim_tv[name])
results['trim_only'] = eval_fn(trim_model)
# 3. Full TIES
ties_tv = ties_merge(task_vectors, density=0.2)
ties_model = copy.deepcopy(base_model)
for name, p in ties_model.named_parameters():
p.data.add_(ties_tv[name])
results['full_ties'] = eval_fn(ties_model)
for method, score in results.items():
print(f"{method:15s}: {score:.4f}")
return results
TIES Ablation: Contribution of Each Step
| Method | Avg Benchmark Score | Improvement over Naive |
|---|---|---|
| Naive sum | 58.3 | Baseline |
| Trim only (top 20%) | 61.7 | +3.4 |
| Trim + Sign election | 63.2 | +4.9 |
| Full TIES (trim + elect + disjoint) | 65.1 | +6.8 |
4. DARE: Drop And REscale
4.1 Concept
DARE (Yu et al., 2023) takes a different approach to the conflict problem. Instead of resolving conflicts, it prevents them by randomly dropping most parameters from each task vector before merging.
The key insight: task vectors are highly redundant. You can randomly zero out 90-99% of the entries and still retain most of the task knowledge, as long as you rescale the remaining entries to compensate.
where is a binary mask with each entry independently set to 1 with probability (Bernoulli()), and the factor rescales to preserve the expected magnitude.
4.2 Implementation
def dare_sparsify(task_vector, drop_rate=0.9, rescale=True):
"""
DARE: randomly drop parameters and rescale.
Args:
task_vector: dict of {param_name: delta_tensor}
drop_rate: fraction of parameters to drop (0.9 = keep 10%)
rescale: whether to rescale remaining parameters by 1/(1-p)
"""
sparse_tv = {}
keep_rate = 1 - drop_rate
for name, delta in task_vector.items():
# Generate random binary mask
mask = torch.bernoulli(
torch.full_like(delta, keep_rate)
)
# Apply mask and rescale
if rescale:
sparse_tv[name] = (delta * mask) / keep_rate
else:
sparse_tv[name] = delta * mask
return sparse_tv
def dare_merge(base_model, task_vectors, drop_rate=0.9, scaling_factor=1.0):
"""Merge using DARE: sparsify each task vector, then average."""
# Sparsify each task vector independently
sparse_tvs = [dare_sparsify(tv, drop_rate) for tv in task_vectors]
# Simple average of sparsified task vectors
merged_tv = {}
for name in task_vectors[0]:
merged_tv[name] = sum(stv[name] for stv in sparse_tvs) / len(sparse_tvs)
# Apply to base model
merged_model = copy.deepcopy(base_model)
for name, param in merged_model.named_parameters():
param.data.add_(scaling_factor * merged_tv[name])
return merged_model
4.3 Why DARE Works: The Dropout Connection
DARE is conceptually related to dropout. During training, dropout randomly zeros neurons and rescales the survivors. This works because the expected output is unchanged (the rescaling compensates for the dropped values). DARE applies the same principle to task vectors: the expected merged task vector is the same as the full sum, but with much less interference between tasks.
The critical difference from dropout: DARE drops different parameters for each task vector. If task A keeps parameters 9 and task B keeps parameters 7, only parameter 5 has a potential conflict. With 90% drop rate and 3 tasks, the expected fraction of parameters where any two tasks conflict is:
For and : . Only 3% of parameters have any conflict at all.
def analyze_dare_conflict_rate(n_tasks, drop_rates):
"""Compute expected conflict rate for different DARE configurations."""
from math import comb
for p in drop_rates:
keep = 1 - p
# Probability that 2+ tasks keep the same parameter
# P(at least 2 keep) = 1 - P(0 keep) - P(exactly 1 keeps)
p_zero = p ** n_tasks
p_one = n_tasks * keep * (p ** (n_tasks - 1))
p_conflict = 1 - p_zero - p_one
n_pairs = comb(n_tasks, 2)
# Approximate: P(any pair conflicts)
p_pair_conflict = keep * keep
p_any_conflict = 1 - (1 - p_pair_conflict) ** n_pairs
print(f"p={p:.1f}, k={n_tasks}: "
f"conflict_rate={p_conflict:.4f}, "
f"effective_params_per_task={keep:.1%}")
analyze_dare_conflict_rate(3, [0.5, 0.7, 0.9, 0.95, 0.99])
# p=0.5, k=3: conflict_rate=0.5000, effective_params_per_task=50.0%
# p=0.7, k=3: conflict_rate=0.2160, effective_params_per_task=30.0%
# p=0.9, k=3: conflict_rate=0.0280, effective_params_per_task=10.0%
# p=0.95, k=3: conflict_rate=0.0073, effective_params_per_task=5.0%
# p=0.99, k=3: conflict_rate=0.0003, effective_params_per_task=1.0%
4.4 DARE + TIES
DARE and TIES are complementary. DARE reduces conflicts through sparsification. TIES resolves any remaining conflicts through sign election. Combining them:
def dare_ties_merge(task_vectors, drop_rate=0.9, density=0.2, scaling_factor=1.0):
"""Combined DARE + TIES merge."""
# Step 1: DARE sparsification
sparse_tvs = [dare_sparsify(tv, drop_rate) for tv in task_vectors]
# Step 2: TIES merge on the sparsified task vectors
merged_tv = ties_merge(sparse_tvs, density=density)
return merged_tv
Merge Method Comparison (3 LoRA Adapters, Llama 3 8B)
| Method | Code (HumanEval) | Math (GSM8K) | Chat (MT-Bench) | Average |
|---|---|---|---|---|
| Linear Average | 41.5 | 48.2 | 6.8 | 48.8 |
| Task Arithmetic (lambda=0.8) | 51.2 | 61.8 | 7.5 | 60.2 |
| TIES (density=0.2) | 54.3 | 64.1 | 7.8 | 63.1 |
| DARE (p=0.9) | 53.8 | 63.5 | 7.7 | 62.5 |
| DARE + TIES | 55.1 | 65.3 | 7.9 | 64.1 |
| Individual code model | 58.5 | 42.1 | 6.2 | 53.6 |
| Individual math model | 38.2 | 68.7 | 6.5 | 53.5 |
5. Evolutionary Merging
5.1 The Per-Layer Coefficient Problem
All methods above use a single scaling factor for the entire model. But different layers might need different merge strengths. Early layers (embeddings, first few transformer blocks) are often more sensitive to perturbation. Late layers (near the output head) encode task-specific features that should be preserved more strongly.
Evolutionary merging optimizes a coefficient vector where is the number of layers, maximizing performance on a validation set:
where is the per-layer scaling factor and are per-task weights.
5.2 CMA-ES Optimization
CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is well-suited for this optimization because the search space is low-dimensional (one coefficient per layer), the objective is noisy (evaluation on a finite dataset), and no gradients are available.
import numpy as np
class CMAESMergeOptimizer:
"""Optimize per-layer merge coefficients using CMA-ES."""
def __init__(self, n_layers, n_tasks, population_size=20, sigma0=0.3):
self.n_layers = n_layers
self.n_tasks = n_tasks
# Search space: n_layers * n_tasks coefficients
self.dim = n_layers * n_tasks
self.pop_size = population_size
self.sigma = sigma0
# CMA-ES state
self.mean = np.ones(self.dim) * 0.5 # Start at 0.5 for all
self.cov = np.eye(self.dim) * sigma0 ** 2
self.best_score = -float('inf')
self.best_coefficients = None
def ask(self):
"""Generate a population of candidate coefficient vectors."""
population = np.random.multivariate_normal(
self.mean, self.cov, size=self.pop_size
)
# Clip to [0, 2] range
population = np.clip(population, 0, 2)
return population
def tell(self, population, scores):
"""Update CMA-ES state based on evaluated scores."""
# Sort by score (descending)
order = np.argsort(scores)[::-1]
sorted_pop = population[order]
sorted_scores = scores[order]
# Update best
if sorted_scores[0] > self.best_score:
self.best_score = sorted_scores[0]
self.best_coefficients = sorted_pop[0].copy()
# Weighted mean update (top half)
n_elite = self.pop_size // 2
weights = np.log(n_elite + 0.5) - np.log(np.arange(1, n_elite + 1))
weights /= weights.sum()
self.mean = np.sum(
weights[:, None] * sorted_pop[:n_elite], axis=0
)
# Simplified covariance update
diffs = sorted_pop[:n_elite] - self.mean
self.cov = np.sum(
weights[:, None, None] * (diffs[:, :, None] * diffs[:, None, :]),
axis=0,
)
self.cov += 1e-6 * np.eye(self.dim) # Regularization
def get_layer_coefficients(self, flat_coefficients):
"""Reshape flat coefficient vector into per-layer, per-task format."""
return flat_coefficients.reshape(self.n_layers, self.n_tasks)
def evolutionary_merge(
base_model, task_vectors, eval_fn,
n_generations=50, population_size=20, device='cuda',
):
"""Find optimal per-layer merge coefficients using CMA-ES."""
n_layers = count_layers(base_model)
n_tasks = len(task_vectors)
optimizer = CMAESMergeOptimizer(n_layers, n_tasks, population_size)
for gen in range(n_generations):
population = optimizer.ask()
scores = np.zeros(len(population))
for i, coefficients in enumerate(population):
layer_coeffs = optimizer.get_layer_coefficients(coefficients)
# Apply per-layer, per-task merge
merged = apply_layerwise_merge(
base_model, task_vectors, layer_coeffs
)
scores[i] = eval_fn(merged)
del merged
torch.cuda.empty_cache()
optimizer.tell(population, scores)
print(f"Gen {gen}: best={optimizer.best_score:.4f}, "
f"mean={scores.mean():.4f}")
return optimizer.best_coefficients
def count_layers(model):
"""Count transformer layers in a model."""
if hasattr(model, 'model') and hasattr(model.model, 'layers'):
return len(model.model.layers)
if hasattr(model, 'transformer') and hasattr(model.transformer, 'h'):
return len(model.transformer.h)
raise ValueError("Cannot determine number of layers")
def apply_layerwise_merge(base_model, task_vectors, layer_coefficients):
"""Apply merge with per-layer, per-task coefficients."""
merged = copy.deepcopy(base_model)
for name, param in merged.named_parameters():
# Determine which layer this parameter belongs to
layer_idx = extract_layer_index(name)
if layer_idx is not None and layer_idx < layer_coefficients.shape[0]:
combined_delta = torch.zeros_like(param.data)
for task_idx, tv in enumerate(task_vectors):
coeff = layer_coefficients[layer_idx, task_idx]
combined_delta += coeff * tv[name]
param.data.add_(combined_delta)
else:
# Non-layer parameters (embeddings, final norm): use mean coefficient
combined_delta = torch.zeros_like(param.data)
for task_idx, tv in enumerate(task_vectors):
coeff = layer_coefficients[:, task_idx].mean()
combined_delta += coeff * tv[name]
param.data.add_(combined_delta)
return merged
def extract_layer_index(param_name):
"""Extract layer index from parameter name like 'model.layers.15.self_attn.q_proj.weight'."""
import re
match = re.search(r'layers?[._](\d+)', param_name)
if match:
return int(match.group(1))
return None
5.3 Evolutionary Merge Results
Per-Layer Merge Coefficients (Evolved, 3 Tasks)
(coefficient x 100)The evolved coefficients consistently show a pattern: low for early layers (preserve base modelβs fundamental representations), high for middle and late layers (incorporate task-specific knowledge), and moderate for the final layer (avoid overshooting near the output head).
Evolutionary vs Uniform Merge Coefficients
| Method | Coefficient Type | Avg Score | Search Cost (GPU-hours) |
|---|---|---|---|
| Task arithmetic | Single global lambda | 60.2 | 0.1 (grid search) |
| TIES | Single global lambda | 63.1 | 0.1 |
| Evolutionary (per-layer) | L coefficients | 67.8 | 5-10 |
| Evolutionary (per-layer, per-task) | L x K coefficients | 68.5 | 20-50 |
6. Implementation: Merging Two LoRA Adapters with TIES
LoRA adapters are the most common merge target because they are small (typically 0.1-1% of model parameters), already structured as task vectors (the adapter weights are the delta from the base model), and can be merged without loading the full model weights.
6.1 LoRA Background
A LoRA adapter replaces a weight matrix with:
where , , is the rank (typically 8-64), and is a scaling factor.
The task vector for a LoRA adapter is . Merging two LoRA adapters is equivalent to merging their task vectors.
6.2 Complete Implementation
import torch
import json
import os
from collections import OrderedDict
class LoRAAdapter:
"""Represents a LoRA adapter with its A and B matrices."""
def __init__(self, state_dict, rank, alpha, target_modules):
self.state_dict = state_dict # {name: tensor}
self.rank = rank
self.alpha = alpha
self.target_modules = target_modules
self.scaling = alpha / rank
@classmethod
def load(cls, path):
"""Load a LoRA adapter from disk."""
adapter_weights = torch.load(
os.path.join(path, "adapter_model.bin"), map_location="cpu"
)
with open(os.path.join(path, "adapter_config.json")) as f:
config = json.load(f)
return cls(
state_dict=adapter_weights,
rank=config["r"],
alpha=config["lora_alpha"],
target_modules=config["target_modules"],
)
def compute_task_vectors(self):
"""Compute the effective weight delta for each target module."""
task_vectors = {}
# Find all A/B pairs
a_matrices = {}
b_matrices = {}
for name, tensor in self.state_dict.items():
if "lora_A" in name:
base_name = name.replace(".lora_A.weight", "")
a_matrices[base_name] = tensor
elif "lora_B" in name:
base_name = name.replace(".lora_B.weight", "")
b_matrices[base_name] = tensor
# Compute delta = (alpha/r) * B @ A
for base_name in a_matrices:
A = a_matrices[base_name] # [r, d_in]
B = b_matrices[base_name] # [d_out, r]
delta = self.scaling * (B @ A) # [d_out, d_in]
task_vectors[base_name] = delta
return task_vectors
def ties_merge_lora(adapters, density=0.2, scaling_factor=1.0):
"""
Merge multiple LoRA adapters using TIES method.
Args:
adapters: list of LoRAAdapter objects
density: fraction of parameters to keep in trimming step
scaling_factor: global scaling for the merged result
Returns:
dict of merged task vectors {module_name: merged_delta}
"""
# Step 1: Compute task vectors for each adapter
all_task_vectors = [adapter.compute_task_vectors() for adapter in adapters]
# Get all module names (union across all adapters)
all_modules = set()
for tvs in all_task_vectors:
all_modules.update(tvs.keys())
merged_deltas = {}
for module_name in sorted(all_modules):
# Collect deltas for this module from all adapters
deltas = []
for tvs in all_task_vectors:
if module_name in tvs:
deltas.append(tvs[module_name])
else:
# Adapter doesn't modify this module -- zero delta
reference = next(iter(tvs.values()))
deltas.append(torch.zeros_like(
list(all_task_vectors[0].values())[0]
if module_name not in all_task_vectors[0]
else all_task_vectors[0][module_name]
))
n_tasks = len(deltas)
# ===== TRIM =====
trimmed = []
for delta in deltas:
if delta.abs().max() == 0:
trimmed.append(delta)
continue
threshold = delta.abs().quantile(1 - density)
mask = delta.abs() >= threshold
trimmed.append(delta * mask.float())
# ===== ELECT SIGN =====
stacked = torch.stack(trimmed) # [n_tasks, d_out, d_in]
sign_vote = stacked.sum(dim=0) # Sum magnitudes with signs
elected_sign = torch.sign(sign_vote)
elected_sign[elected_sign == 0] = 1.0
# ===== DISJOINT MERGE =====
agree_sum = torch.zeros_like(deltas[0])
agree_count = torch.zeros_like(deltas[0])
for t in trimmed:
agrees = (torch.sign(t) == elected_sign) & (t != 0)
agree_sum += t * agrees.float()
agree_count += agrees.float()
merged_delta = agree_sum / agree_count.clamp(min=1)
merged_deltas[module_name] = scaling_factor * merged_delta
return merged_deltas
def apply_merged_lora_to_model(base_model, merged_deltas):
"""Apply merged LoRA deltas to the base model weights."""
merged_model = copy.deepcopy(base_model)
applied = 0
for name, param in merged_model.named_parameters():
# Map parameter name to LoRA module name
module_name = name.replace(".weight", "")
if module_name in merged_deltas:
param.data.add_(merged_deltas[module_name])
applied += 1
print(f"Applied merged deltas to {applied} parameters")
return merged_model
6.3 End-to-End LoRA Merge Example
def merge_lora_adapters_example():
"""Complete example: merge code and math LoRA adapters."""
from transformers import AutoModelForCausalLM
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
torch_dtype=torch.bfloat16,
device_map="cpu", # CPU for merging (memory)
)
# Load LoRA adapters
code_adapter = LoRAAdapter.load("./code-lora-adapter")
math_adapter = LoRAAdapter.load("./math-lora-adapter")
print(f"Code adapter: rank={code_adapter.rank}, "
f"alpha={code_adapter.alpha}, "
f"scaling={code_adapter.scaling:.4f}")
print(f"Math adapter: rank={math_adapter.rank}, "
f"alpha={math_adapter.alpha}, "
f"scaling={math_adapter.scaling:.4f}")
# Analyze task vector properties
code_tvs = code_adapter.compute_task_vectors()
math_tvs = math_adapter.compute_task_vectors()
for name in list(code_tvs.keys())[:3]:
code_delta = code_tvs[name]
math_delta = math_tvs[name]
# Check sign agreement
code_sign = torch.sign(code_delta)
math_sign = torch.sign(math_delta)
agree = (code_sign == math_sign).float().mean()
# Check magnitude
code_mag = code_delta.abs().mean()
math_mag = math_delta.abs().mean()
print(f"\n{name}:")
print(f" Code delta magnitude: {code_mag:.6f}")
print(f" Math delta magnitude: {math_mag:.6f}")
print(f" Sign agreement: {agree:.2%}")
# Merge with TIES
print("\nMerging with TIES (density=0.2)...")
merged_deltas = ties_merge_lora(
[code_adapter, math_adapter],
density=0.2,
scaling_factor=0.8,
)
# Apply to base model
merged_model = apply_merged_lora_to_model(base_model, merged_deltas)
# Save merged model
merged_model.save_pretrained("llama-8b-code-math-merged")
print("Merged model saved.")
return merged_model
6.4 Merge Quality Diagnostics
def diagnose_merge_quality(base_model, merged_model, individual_models, eval_datasets):
"""Comprehensive merge quality analysis."""
results = {'base': {}, 'merged': {}}
for name in individual_models:
results[name] = {}
for task_name, dataset in eval_datasets.items():
# Evaluate base
results['base'][task_name] = evaluate(base_model, dataset)
# Evaluate individual fine-tuned models
for model_name, model in individual_models.items():
results[model_name][task_name] = evaluate(model, dataset)
# Evaluate merged
results['merged'][task_name] = evaluate(merged_model, dataset)
# Compute merge efficiency metrics
print("\n=== Merge Quality Report ===")
for task_name in eval_datasets:
base_score = results['base'][task_name]
merged_score = results['merged'][task_name]
best_individual = max(
results[name][task_name] for name in individual_models
)
improvement_over_base = merged_score - base_score
retention_of_best = merged_score / best_individual * 100
print(f"\n{task_name}:")
print(f" Base: {base_score:.2f}")
print(f" Best individual: {best_individual:.2f}")
print(f" Merged: {merged_score:.2f}")
print(f" Improvement over base: +{improvement_over_base:.2f}")
print(f" Retention of best: {retention_of_best:.1f}%")
# Average across tasks
merged_avg = sum(results['merged'].values()) / len(results['merged'])
individual_avgs = {
name: sum(scores.values()) / len(scores)
for name, scores in results.items()
if name not in ('base', 'merged')
}
best_avg = max(individual_avgs.values())
print(f"\n=== Summary ===")
print(f"Merged average: {merged_avg:.2f}")
print(f"Best individual average: {best_avg:.2f}")
print(f"Merged outperforms best individual by: "
f"{merged_avg - best_avg:+.2f}")
7. Practical Guidelines
7.1 Decision Tree
def recommend_merge_method(n_models, same_base, compute_budget_hours):
"""Recommend a merge method based on constraints."""
if not same_base:
print("ERROR: Cannot merge models with different base architectures.")
print("Retrain from a common base or use ensemble instead.")
return None
if n_models == 2:
if compute_budget_hours < 0.1:
return "Linear averaging (alpha=0.5) or task arithmetic (lambda=0.8)"
elif compute_budget_hours < 1:
return "TIES (density=0.2, sweep lambda in [0.5, 1.0])"
else:
return "Evolutionary per-layer merge"
elif n_models <= 5:
if compute_budget_hours < 0.5:
return "DARE (p=0.9) + simple average"
elif compute_budget_hours < 5:
return "DARE + TIES (density=0.2)"
else:
return "Evolutionary per-layer, per-task merge"
else: # Many models
return "DARE (p=0.95) + TIES -- high sparsification essential"
7.2 Hyperparameter Defaults
Recommended Hyperparameters by Method
| Method | Key Parameter | Default Value | Search Range |
|---|---|---|---|
| Linear Average | alpha | 0.5 | [0.3, 0.7] |
| Task Arithmetic | lambda | 0.8 | [0.5, 1.2] |
| TIES | density | 0.2 | [0.1, 0.5] |
| TIES | scaling_factor | 1.0 | [0.5, 1.5] |
| DARE | drop_rate | 0.9 | [0.8, 0.99] |
| DARE | scaling_factor | 1.0 | [0.5, 1.5] |
| Evolutionary | population_size | 20 | [10, 50] |
| Evolutionary | n_generations | 50 | [30, 100] |
7.3 Common Pitfalls
-
Forgetting to subtract the base: Task arithmetic requires computing . Using absolute weights instead of deltas produces garbage.
-
Merging with different tokenizers: Even if two models share the same architecture, different tokenizers mean different embedding matrices. The merge will silently produce wrong token mappings.
-
Ignoring non-layer parameters: Embeddings, final layer norm, and the LM head are not part of any transformer layer. They need separate merge handling (usually conservative averaging).
-
Too many models: Each additional model in the merge increases conflicts. Beyond 5-6 models, even DARE + TIES struggles. Use DARE with very high drop rates () or cluster models by task similarity before merging.
Model merging has no training cost, but it has a quality cost. A merged model almost never matches a single model fine-tuned on the combined dataset. Merging is a trade-off: zero compute cost in exchange for 5-15% quality degradation versus dedicated training. Use merging when you cannot afford to retrain, or when you need to quickly combine capabilities from multiple sources.
8. Advanced: SLERP and Model Soups
8.1 Spherical Linear Interpolation (SLERP)
Linear interpolation moves along a straight line in weight space. SLERP moves along the surface of a hypersphere, preserving the magnitude of the weight vectors:
where is the angle between the weight vectors.
def slerp_merge(model_a, model_b, alpha=0.5):
"""Spherical linear interpolation of model weights."""
merged = copy.deepcopy(model_a)
for (name, param_a), (_, param_b) in zip(
model_a.named_parameters(), model_b.named_parameters()
):
flat_a = param_a.data.flatten().float()
flat_b = param_b.data.flatten().float()
# Normalize
norm_a = flat_a.norm()
norm_b = flat_b.norm()
if norm_a < 1e-8 or norm_b < 1e-8:
# Degenerate case: fall back to linear
result = alpha * param_a.data + (1 - alpha) * param_b.data
else:
unit_a = flat_a / norm_a
unit_b = flat_b / norm_b
# Angle between vectors
cos_theta = torch.clamp(torch.dot(unit_a, unit_b), -1, 1)
theta = torch.acos(cos_theta)
if theta.abs() < 1e-6:
# Nearly parallel: linear interpolation
result = alpha * param_a.data + (1 - alpha) * param_b.data
else:
# SLERP
sin_theta = torch.sin(theta)
w_a = torch.sin((1 - alpha) * theta) / sin_theta
w_b = torch.sin(alpha * theta) / sin_theta
result_flat = w_a * flat_a + w_b * flat_b
result = result_flat.reshape(param_a.shape).to(param_a.dtype)
merged.state_dict()[name].copy_(result)
return merged
8.2 Model Soups
Model Soups (Wortsman et al., 2022) average multiple fine-tuning runs of the same model on the same data with different hyperparameters. This is distinct from task merging β it merges for robustness rather than capability combination.
def greedy_model_soup(models, eval_fn):
"""
Greedy Model Soup: iteratively add models that improve the average.
Start with the best individual model. Try adding each remaining model.
Keep the addition that improves quality the most. Repeat until no
improvement is possible.
"""
# Find best individual model
individual_scores = [(i, eval_fn(m)) for i, m in enumerate(models)]
individual_scores.sort(key=lambda x: x[1], reverse=True)
soup_indices = [individual_scores[0][0]]
current_soup = copy.deepcopy(models[soup_indices[0]])
current_score = individual_scores[0][1]
remaining = [i for i, _ in individual_scores[1:]]
while remaining:
best_candidate = None
best_new_score = current_score
for idx in remaining:
# Try adding this model to the soup
trial_models = [models[i] for i in soup_indices] + [models[idx]]
trial_soup = multi_model_average(trial_models)
trial_score = eval_fn(trial_soup)
if trial_score > best_new_score:
best_new_score = trial_score
best_candidate = idx
if best_candidate is not None:
soup_indices.append(best_candidate)
remaining.remove(best_candidate)
current_soup = multi_model_average([models[i] for i in soup_indices])
current_score = best_new_score
print(f"Added model {best_candidate}, "
f"soup size={len(soup_indices)}, "
f"score={current_score:.4f}")
else:
print("No improvement possible. Stopping.")
break
return current_soup, soup_indices
9. Benchmarking Merge Methods at Scale
def comprehensive_merge_benchmark(
base_model_name, adapter_paths, eval_datasets, device='cuda',
):
"""Run all merge methods and compare results."""
from transformers import AutoModelForCausalLM
# Load base model
base = AutoModelForCausalLM.from_pretrained(
base_model_name, torch_dtype=torch.bfloat16, device_map=device
)
# Load adapters
adapters = [LoRAAdapter.load(p) for p in adapter_paths]
# Compute task vectors
task_vectors = [a.compute_task_vectors() for a in adapters]
methods = {
'Simple Average': lambda tvs: {
k: sum(tv[k] for tv in tvs) / len(tvs) for k in tvs[0]
},
'Task Arithmetic (0.8)': lambda tvs: {
k: 0.8 * sum(tv[k] for tv in tvs) for k in tvs[0]
},
'TIES (d=0.2)': lambda tvs: ties_merge(tvs, density=0.2),
'DARE (p=0.9)': lambda tvs: {
k: sum(dare_sparsify(tv, 0.9)[k] for tv in tvs) / len(tvs)
for k in tvs[0]
},
'DARE+TIES': lambda tvs: dare_ties_merge(tvs, drop_rate=0.9, density=0.2),
}
all_results = {}
for method_name, merge_fn in methods.items():
print(f"\n{'='*60}")
print(f"Method: {method_name}")
merged_tv = merge_fn(task_vectors)
# Apply to base model
merged_model = copy.deepcopy(base)
for name, param in merged_model.named_parameters():
module_name = name.replace(".weight", "")
if module_name in merged_tv:
param.data.add_(merged_tv[module_name])
# Evaluate
scores = {}
for task_name, dataset in eval_datasets.items():
score = evaluate(merged_model, dataset)
scores[task_name] = score
print(f" {task_name}: {score:.2f}")
scores['average'] = sum(scores.values()) / len(scores)
all_results[method_name] = scores
del merged_model
torch.cuda.empty_cache()
# Print comparison table
print(f"\n{'='*60}")
print("SUMMARY")
for method_name, scores in sorted(
all_results.items(), key=lambda x: x[1]['average'], reverse=True
):
print(f" {method_name:25s}: avg={scores['average']:.2f}")
return all_results
Merge Method Quality Ranking (Average Across Tasks)
(score x 10)10. Summary
Model merging is a zero-cost technique for combining capabilities from multiple fine-tuned models. The quality hierarchy is clear:
- Evolutionary per-layer merge: Best quality (+18% over linear averaging), but requires 5-50 GPU-hours of search
- DARE + TIES: Best automated method (+15% over linear averaging), no search required
- TIES: Strong default (+14%), handles sign conflicts
- DARE: Simple and effective (+14%), handles conflict through sparsification
- Task arithmetic: Reasonable baseline (+11%), requires lambda tuning
- Linear averaging: Simplest but weakest, works only when models are very similar
Key constraints to remember:
- All models must share the same base architecture and tokenizer
- Fine-tuning should be relatively light (LoRA or short full fine-tuning)
- More than 5-6 models in a single merge degrades quality significantly
- Merging is not a substitute for training on combined data (5-15% quality gap)
Reviewer Agent Validation
Challenge: Two LoRA adapters produce task vectors for the same weight matrix. Adapter A has delta values and Adapter B has delta values . Apply the TIES merge with density=0.4 (keep top 40% by magnitude in each adapter).
Step 1 β Trim: Keep top 40% = top 2 values per adapter (by magnitude).
Adapter A magnitudes: . Top 2: positions 0 () and 3 (). Trimmed A:
Adapter B magnitudes: . Top 2: positions 1 () and 2 (). Trimmed B:
Step 2 β Elect Sign: Sum trimmed values at each position. Position 0: sign = Position 1: sign = Position 2: sign = Position 3: sign = Position 4: sign = (default)
Step 3 β Disjoint Merge: Average values that agree with elected sign. Position 0: Elected . A has (agrees). B has (skip). Average = Position 1: Elected . A has (skip). B has (agrees). Average = Position 2: Elected . A has (skip). B has (agrees). Average = Position 3: Elected . A has (agrees). B has (skip). Average = Position 4: No nonzero values. Result = .
Result: TIES merged vector = .
Compare to naive sum: . TIES preserves the dominant contributor at each position rather than blending conflicting signals. Position 3 retains Adapter Aβs strong instead of being diluted to by Adapter Bβs opposing .