Part of Series Transformer Anatomy 30 of 23
1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text β€” From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

Model merging combines multiple fine-tuned models into a single model without any additional training. You have a model fine-tuned for code generation, another fine-tuned for math reasoning, and a third fine-tuned for creative writing. Merging produces a single model that performs all three tasks. No GPU hours. No training data. Just weight arithmetic.

This works because fine-tuning a pretrained model moves its weights by a small amount β€” typically less than 1% of the original magnitude. These small perturbations (task vectors) encode task-specific knowledge and can be composed through addition, averaging, or more sophisticated methods like TIES and DARE.

This post covers every major merging technique with exact mathematics, implementation code, and empirical results showing when each method works and when it fails.


1. Linear Weight Averaging

1.1 Definition

The simplest merge: linearly interpolate the weights of two models.

Wmerged=Ξ±WA+(1βˆ’Ξ±)WBW_{\text{merged}} = \alpha W_A + (1 - \alpha) W_B

where WAW_A and WBW_B are the weight tensors of models AA and BB, and α∈[0,1]\alpha \in [0, 1] controls the interpolation.

import torch
import copy

def linear_merge(model_a, model_b, alpha=0.5):
    """Merge two models by linear weight interpolation."""
    merged = copy.deepcopy(model_a)

    for (name_a, param_a), (name_b, param_b) in zip(
        model_a.named_parameters(), model_b.named_parameters()
    ):
        assert name_a == name_b, f"Parameter name mismatch: {name_a} vs {name_b}"
        merged_param = alpha * param_a.data + (1 - alpha) * param_b.data
        merged.state_dict()[name_a].copy_(merged_param)

    return merged

def multi_model_average(models, weights=None):
    """Average N models with optional weights."""
    if weights is None:
        weights = [1.0 / len(models)] * len(models)

    assert abs(sum(weights) - 1.0) < 1e-6, "Weights must sum to 1"

    merged = copy.deepcopy(models[0])
    for name, param in merged.named_parameters():
        weighted_sum = torch.zeros_like(param.data)
        for model, w in zip(models, weights):
            weighted_sum += w * dict(model.named_parameters())[name].data
        param.data.copy_(weighted_sum)

    return merged

1.2 When Linear Averaging Works

Linear averaging works when the two models share the same loss basin β€” they occupy nearby regions of the loss landscape that are connected by a path of low loss. This happens when:

  1. Both models are fine-tuned from the same base model (same initialization)
  2. The fine-tuning is relatively short (weights have not diverged far)
  3. The tasks are not contradictory (code + math is fine; β€œalways say yes” + β€œalways say no” is not)
def verify_same_basin(model_a, model_b, base_model, eval_fn, n_points=11):
    """Check if two models are in the same loss basin by linear interpolation."""
    alphas = torch.linspace(0, 1, n_points)
    losses = []

    for alpha in alphas:
        merged = linear_merge(model_a, model_b, alpha=alpha.item())
        loss = eval_fn(merged)
        losses.append(loss)
        print(f"alpha={alpha:.1f}: loss={loss:.4f}")

    # If in same basin: losses form a smooth curve, no barrier
    # If different basins: losses spike in the middle
    max_loss = max(losses)
    endpoint_avg = (losses[0] + losses[-1]) / 2
    barrier = max_loss - endpoint_avg

    print(f"\nLoss barrier: {barrier:.4f}")
    if barrier < 0.5:
        print("Models are in the same basin -- linear averaging is safe")
    else:
        print("Models are in different basins -- use task arithmetic instead")

    return losses

1.3 When Linear Averaging Fails

Different base models: Merging Llama 3 8B fine-tuned on code with Mistral 7B fine-tuned on math produces garbage. The weight spaces are completely different.

Too much fine-tuning: Models that have been trained for many epochs on large datasets diverge from the base model significantly. The interpolation path crosses high-loss barriers.

Contradictory tasks: If model A learned to always output JSON and model B learned to always output prose, averaging their weights produces a model that outputs broken half-JSON.

⚠️ Architecture Must Match Exactly

Linear averaging requires identical architectures: same number of layers, same hidden dimensions, same vocabulary. You cannot merge a 7B model with a 13B model. You cannot merge models with different tokenizers (different embedding matrices). Even models with the same architecture but different vocabulary sizes (e.g., extended tokenizer) cannot be directly averaged.


2. Task Arithmetic

2.1 Task Vectors

A task vector is the difference between a fine-tuned model and the base model:

Ο„A=WAβˆ’Wbase\tau_A = W_A - W_{\text{base}}

This vector encodes the knowledge learned during fine-tuning. It is typically sparse (most weights change very little) and small in magnitude (usually less than 1% of the weight values).

def compute_task_vector(finetuned_model, base_model):
    """Compute the task vector (weight difference) between fine-tuned and base."""
    task_vector = {}
    for (name, ft_param), (_, base_param) in zip(
        finetuned_model.named_parameters(), base_model.named_parameters()
    ):
        task_vector[name] = ft_param.data - base_param.data
    return task_vector

def task_vector_stats(task_vector):
    """Analyze task vector properties."""
    total_params = 0
    total_nonzero = 0
    total_magnitude = 0

    for name, delta in task_vector.items():
        n = delta.numel()
        total_params += n
        total_nonzero += (delta.abs() > 1e-8).sum().item()
        total_magnitude += delta.abs().sum().item()

    print(f"Total parameters: {total_params:,}")
    print(f"Non-zero deltas: {total_nonzero:,} ({total_nonzero/total_params:.2%})")
    print(f"Mean |delta|: {total_magnitude/total_params:.6f}")
    print(f"Relative magnitude: {total_magnitude/total_params:.4%} of typical weight")

2.2 Task Vector Composition

Multiple task vectors can be combined to create a model with multiple capabilities:

Wmerged=Wbase+Ξ»(Ο„A+Ο„B)W_{\text{merged}} = W_{\text{base}} + \lambda(\tau_A + \tau_B)

where Ξ»\lambda is a scaling factor that controls the strength of the combined task knowledge. This is called task arithmetic because you perform arithmetic operations on task vectors.

def task_arithmetic_merge(base_model, task_vectors, scaling_factor=1.0):
    """Merge by adding scaled task vectors to the base model."""
    merged = copy.deepcopy(base_model)

    for name, param in merged.named_parameters():
        combined_delta = torch.zeros_like(param.data)
        for tv in task_vectors:
            combined_delta += tv[name]
        param.data.add_(scaling_factor * combined_delta)

    return merged

def task_arithmetic_merge_weighted(base_model, task_vectors, weights):
    """Merge with per-task-vector weights."""
    merged = copy.deepcopy(base_model)

    for name, param in merged.named_parameters():
        combined_delta = torch.zeros_like(param.data)
        for tv, w in zip(task_vectors, weights):
            combined_delta += w * tv[name]
        param.data.add_(combined_delta)

    return merged

# Example: merge code, math, and writing capabilities
base = load_model("meta-llama/Llama-3-8B")
code_model = load_model("code-llama-8b")
math_model = load_model("math-llama-8b")
writing_model = load_model("writing-llama-8b")

tv_code = compute_task_vector(code_model, base)
tv_math = compute_task_vector(math_model, base)
tv_writing = compute_task_vector(writing_model, base)

# Equal weighting
merged = task_arithmetic_merge(
    base, [tv_code, tv_math, tv_writing], scaling_factor=0.8
)

# Custom weighting (more code, less writing)
merged_custom = task_arithmetic_merge_weighted(
    base, [tv_code, tv_math, tv_writing], weights=[0.5, 0.3, 0.2]
)

2.3 Task Vector Negation

An interesting property: negating a task vector removes the corresponding capability:

Wnegated=Wbaseβˆ’Ξ»Ο„AW_{\text{negated}} = W_{\text{base}} - \lambda \tau_A

This can be used for β€œunlearning” β€” removing specific knowledge or behaviors from a model.

def negate_task(base_model, task_vector, scaling_factor=1.0):
    """Remove a capability by subtracting its task vector."""
    negated = copy.deepcopy(base_model)
    for name, param in negated.named_parameters():
        param.data.sub_(scaling_factor * task_vector[name])
    return negated

# Example: remove toxicity learned during fine-tuning
# toxic_tv = compute_task_vector(toxic_finetuned, base)
# clean_model = negate_task(toxic_finetuned, toxic_tv, scaling_factor=0.5)

2.4 The Scaling Factor Problem

The scaling factor Ξ»\lambda is critical. Too high: the merged model overshoots and produces degenerate outputs. Too low: the merged model retains insufficient task knowledge.

def sweep_scaling_factor(base_model, task_vectors, eval_fn, factors=None):
    """Find optimal scaling factor by grid search."""
    if factors is None:
        factors = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]

    results = {}
    for lam in factors:
        merged = task_arithmetic_merge(base_model, task_vectors, scaling_factor=lam)
        score = eval_fn(merged)
        results[lam] = score
        print(f"lambda={lam:.1f}: score={score:.4f}")

    best_lambda = max(results, key=results.get)
    print(f"\nBest scaling factor: {best_lambda}")
    return best_lambda
πŸ“Š

Scaling Factor Effect on Merge Quality

LambdaCode (HumanEval)Math (GSM8K)Writing (AlpacaEval)Average
0.2 38.4 42.1 71.2 50.6
0.5 45.7 55.3 78.4 59.8
0.8 51.2 61.8 82.1 65.0
1.0 52.1 60.5 80.3 64.3
1.2 48.3 54.2 75.8 59.4
1.5 31.7 35.6 62.3 43.2
Note: Merging 3 task vectors (code + math + writing) into Llama 3 8B base. Lambda=0.8 is optimal. Lambda > 1.0 overshoots. Lambda > 1.5 produces degenerate outputs.

3. TIES-Merging: Resolving Conflicts

3.1 The Problem with Naive Addition

When adding multiple task vectors, conflicts arise. For a single weight parameter, task vector A might say β€œincrease by +0.05” while task vector B says β€œdecrease by -0.03”. Naive addition gives +0.02, but this might be worse than either individual change.

TIES-Merging (Yadav et al., 2023) addresses three specific problems:

  1. Redundant parameters: Most task vector entries are very small (noise, not signal)
  2. Sign conflicts: Different tasks pull the same weight in opposite directions
  3. Magnitude differences: Some tasks have larger magnitudes that dominate the merge

3.2 The TIES Algorithm

TIES has three steps: Trim, Initialize signs by Election, and merge with Selected signs.

Step 1: Trim β€” Set small-magnitude entries to zero. Only keep the top-k%k\% largest changes.

Step 2: Elect Sign β€” For each parameter, count how many task vectors want to increase it vs decrease it (majority vote on sign).

Step 3: Disjoint Merge β€” Average only the task vectors that agree with the elected sign. Discard the rest.

def ties_merge(task_vectors, density=0.2, majority_sign_method="total"):
    """
    TIES-Merging: Trim, Elect Sign, Disjoint Merge.

    Args:
        task_vectors: list of dicts {param_name: delta_tensor}
        density: fraction of parameters to keep (top-k% by magnitude)
        majority_sign_method: "total" (sum magnitudes) or "count" (count votes)
    """
    merged_tv = {}

    # Get parameter names from first task vector
    param_names = list(task_vectors[0].keys())

    for name in param_names:
        deltas = [tv[name] for tv in task_vectors]
        n_tasks = len(deltas)

        # ===== Step 1: TRIM =====
        # Keep only top-k% by magnitude in each task vector
        trimmed = []
        for delta in deltas:
            threshold = delta.abs().quantile(1 - density)
            mask = delta.abs() >= threshold
            trimmed.append(delta * mask.float())

        # ===== Step 2: ELECT SIGN =====
        # For each parameter position, determine the majority sign
        stacked = torch.stack(trimmed)  # [n_tasks, *param_shape]

        if majority_sign_method == "total":
            # Weight by magnitude: sum of signed values
            sign_vote = stacked.sum(dim=0)
        else:
            # Count method: number of positive vs negative
            sign_vote = (stacked > 0).float().sum(dim=0) - \
                        (stacked < 0).float().sum(dim=0)

        elected_sign = torch.sign(sign_vote)
        # Where sign_vote is 0, default to positive
        elected_sign[elected_sign == 0] = 1.0

        # ===== Step 3: DISJOINT MERGE =====
        # Average only task vectors that agree with elected sign
        agree_sum = torch.zeros_like(deltas[0])
        agree_count = torch.zeros_like(deltas[0])

        for t in trimmed:
            # Check agreement: same sign as elected, and non-zero
            agrees = (torch.sign(t) == elected_sign) & (t != 0)
            agree_sum += t * agrees.float()
            agree_count += agrees.float()

        # Average the agreeing values
        merged_tv[name] = agree_sum / agree_count.clamp(min=1)

    return merged_tv


def apply_ties_merge(base_model, task_vectors, scaling_factor=1.0, density=0.2):
    """Apply TIES merge to produce final model."""
    merged_tv = ties_merge(task_vectors, density=density)
    merged_model = copy.deepcopy(base_model)

    for name, param in merged_model.named_parameters():
        param.data.add_(scaling_factor * merged_tv[name])

    return merged_model

3.3 Why Each Step Matters

def ablation_ties_steps(base_model, task_vectors, eval_fn):
    """Ablation: contribution of each TIES step."""
    results = {}

    # 1. Naive sum (no TIES)
    naive_tv = {}
    for name in task_vectors[0]:
        naive_tv[name] = sum(tv[name] for tv in task_vectors)
    naive_model = copy.deepcopy(base_model)
    for name, p in naive_model.named_parameters():
        p.data.add_(naive_tv[name])
    results['naive'] = eval_fn(naive_model)

    # 2. Trim only
    trimmed_tvs = []
    for tv in task_vectors:
        trimmed = {}
        for name, delta in tv.items():
            threshold = delta.abs().quantile(0.8)
            mask = delta.abs() >= threshold
            trimmed[name] = delta * mask.float()
        trimmed_tvs.append(trimmed)
    trim_tv = {}
    for name in task_vectors[0]:
        trim_tv[name] = sum(t[name] for t in trimmed_tvs) / len(trimmed_tvs)
    trim_model = copy.deepcopy(base_model)
    for name, p in trim_model.named_parameters():
        p.data.add_(trim_tv[name])
    results['trim_only'] = eval_fn(trim_model)

    # 3. Full TIES
    ties_tv = ties_merge(task_vectors, density=0.2)
    ties_model = copy.deepcopy(base_model)
    for name, p in ties_model.named_parameters():
        p.data.add_(ties_tv[name])
    results['full_ties'] = eval_fn(ties_model)

    for method, score in results.items():
        print(f"{method:15s}: {score:.4f}")

    return results
πŸ“Š

TIES Ablation: Contribution of Each Step

MethodAvg Benchmark ScoreImprovement over Naive
Naive sum 58.3 Baseline
Trim only (top 20%) 61.7 +3.4
Trim + Sign election 63.2 +4.9
Full TIES (trim + elect + disjoint) 65.1 +6.8
Note: Merging 3 LoRA adapters (code, math, chat) into Llama 3 8B. Trimming contributes most (removes noise). Sign election adds another 1.5 points. Disjoint averaging adds the final 1.9 points.

4. DARE: Drop And REscale

4.1 Concept

DARE (Yu et al., 2023) takes a different approach to the conflict problem. Instead of resolving conflicts, it prevents them by randomly dropping most parameters from each task vector before merging.

The key insight: task vectors are highly redundant. You can randomly zero out 90-99% of the entries and still retain most of the task knowledge, as long as you rescale the remaining entries to compensate.

Ο„~i=miβŠ™Ο„i1βˆ’p\tilde{\tau}_i = \frac{m_i \odot \tau_i}{1 - p}

where mim_i is a binary mask with each entry independently set to 1 with probability 1βˆ’p1 - p (Bernoulli(1βˆ’p1 - p)), and the 1/(1βˆ’p)1/(1 - p) factor rescales to preserve the expected magnitude.

4.2 Implementation

def dare_sparsify(task_vector, drop_rate=0.9, rescale=True):
    """
    DARE: randomly drop parameters and rescale.

    Args:
        task_vector: dict of {param_name: delta_tensor}
        drop_rate: fraction of parameters to drop (0.9 = keep 10%)
        rescale: whether to rescale remaining parameters by 1/(1-p)
    """
    sparse_tv = {}
    keep_rate = 1 - drop_rate

    for name, delta in task_vector.items():
        # Generate random binary mask
        mask = torch.bernoulli(
            torch.full_like(delta, keep_rate)
        )

        # Apply mask and rescale
        if rescale:
            sparse_tv[name] = (delta * mask) / keep_rate
        else:
            sparse_tv[name] = delta * mask

    return sparse_tv


def dare_merge(base_model, task_vectors, drop_rate=0.9, scaling_factor=1.0):
    """Merge using DARE: sparsify each task vector, then average."""
    # Sparsify each task vector independently
    sparse_tvs = [dare_sparsify(tv, drop_rate) for tv in task_vectors]

    # Simple average of sparsified task vectors
    merged_tv = {}
    for name in task_vectors[0]:
        merged_tv[name] = sum(stv[name] for stv in sparse_tvs) / len(sparse_tvs)

    # Apply to base model
    merged_model = copy.deepcopy(base_model)
    for name, param in merged_model.named_parameters():
        param.data.add_(scaling_factor * merged_tv[name])

    return merged_model

4.3 Why DARE Works: The Dropout Connection

DARE is conceptually related to dropout. During training, dropout randomly zeros neurons and rescales the survivors. This works because the expected output is unchanged (the rescaling compensates for the dropped values). DARE applies the same principle to task vectors: the expected merged task vector is the same as the full sum, but with much less interference between tasks.

The critical difference from dropout: DARE drops different parameters for each task vector. If task A keeps parameters 9 and task B keeps parameters 7, only parameter 5 has a potential conflict. With 90% drop rate and 3 tasks, the expected fraction of parameters where any two tasks conflict is:

P(conflictΒ atΒ positionΒ i)=1βˆ’(1βˆ’(1βˆ’p)2)(k2)P(\text{conflict at position } i) = 1 - (1 - (1-p)^2)^{\binom{k}{2}}

For p=0.9p = 0.9 and k=3k = 3: P=1βˆ’(1βˆ’0.01)3=0.0297P = 1 - (1 - 0.01)^3 = 0.0297. Only 3% of parameters have any conflict at all.

def analyze_dare_conflict_rate(n_tasks, drop_rates):
    """Compute expected conflict rate for different DARE configurations."""
    from math import comb

    for p in drop_rates:
        keep = 1 - p
        # Probability that 2+ tasks keep the same parameter
        # P(at least 2 keep) = 1 - P(0 keep) - P(exactly 1 keeps)
        p_zero = p ** n_tasks
        p_one = n_tasks * keep * (p ** (n_tasks - 1))
        p_conflict = 1 - p_zero - p_one
        n_pairs = comb(n_tasks, 2)
        # Approximate: P(any pair conflicts)
        p_pair_conflict = keep * keep
        p_any_conflict = 1 - (1 - p_pair_conflict) ** n_pairs

        print(f"p={p:.1f}, k={n_tasks}: "
              f"conflict_rate={p_conflict:.4f}, "
              f"effective_params_per_task={keep:.1%}")

analyze_dare_conflict_rate(3, [0.5, 0.7, 0.9, 0.95, 0.99])
# p=0.5, k=3: conflict_rate=0.5000, effective_params_per_task=50.0%
# p=0.7, k=3: conflict_rate=0.2160, effective_params_per_task=30.0%
# p=0.9, k=3: conflict_rate=0.0280, effective_params_per_task=10.0%
# p=0.95, k=3: conflict_rate=0.0073, effective_params_per_task=5.0%
# p=0.99, k=3: conflict_rate=0.0003, effective_params_per_task=1.0%

4.4 DARE + TIES

DARE and TIES are complementary. DARE reduces conflicts through sparsification. TIES resolves any remaining conflicts through sign election. Combining them:

def dare_ties_merge(task_vectors, drop_rate=0.9, density=0.2, scaling_factor=1.0):
    """Combined DARE + TIES merge."""
    # Step 1: DARE sparsification
    sparse_tvs = [dare_sparsify(tv, drop_rate) for tv in task_vectors]

    # Step 2: TIES merge on the sparsified task vectors
    merged_tv = ties_merge(sparse_tvs, density=density)

    return merged_tv
πŸ“Š

Merge Method Comparison (3 LoRA Adapters, Llama 3 8B)

MethodCode (HumanEval)Math (GSM8K)Chat (MT-Bench)Average
Linear Average 41.5 48.2 6.8 48.8
Task Arithmetic (lambda=0.8) 51.2 61.8 7.5 60.2
TIES (density=0.2) 54.3 64.1 7.8 63.1
DARE (p=0.9) 53.8 63.5 7.7 62.5
DARE + TIES 55.1 65.3 7.9 64.1
Individual code model 58.5 42.1 6.2 53.6
Individual math model 38.2 68.7 6.5 53.5
Note: Merged models outperform individual models on average by combining capabilities. DARE + TIES achieves the best overall score. Individual models are better at their specific task but worse at others.

5. Evolutionary Merging

5.1 The Per-Layer Coefficient Problem

All methods above use a single scaling factor for the entire model. But different layers might need different merge strengths. Early layers (embeddings, first few transformer blocks) are often more sensitive to perturbation. Late layers (near the output head) encode task-specific features that should be preserved more strongly.

Evolutionary merging optimizes a coefficient vector α∈RL\alpha \in \mathbb{R}^{L} where LL is the number of layers, maximizing performance on a validation set:

Wmerged(l)=Wbase(l)+Ξ±lβˆ‘iwiΟ„i(l)W_{\text{merged}}^{(l)} = W_{\text{base}}^{(l)} + \alpha_l \sum_i w_i \tau_i^{(l)}

where Ξ±l\alpha_l is the per-layer scaling factor and wiw_i are per-task weights.

5.2 CMA-ES Optimization

CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is well-suited for this optimization because the search space is low-dimensional (one coefficient per layer), the objective is noisy (evaluation on a finite dataset), and no gradients are available.

import numpy as np

class CMAESMergeOptimizer:
    """Optimize per-layer merge coefficients using CMA-ES."""

    def __init__(self, n_layers, n_tasks, population_size=20, sigma0=0.3):
        self.n_layers = n_layers
        self.n_tasks = n_tasks
        # Search space: n_layers * n_tasks coefficients
        self.dim = n_layers * n_tasks
        self.pop_size = population_size
        self.sigma = sigma0

        # CMA-ES state
        self.mean = np.ones(self.dim) * 0.5  # Start at 0.5 for all
        self.cov = np.eye(self.dim) * sigma0 ** 2
        self.best_score = -float('inf')
        self.best_coefficients = None

    def ask(self):
        """Generate a population of candidate coefficient vectors."""
        population = np.random.multivariate_normal(
            self.mean, self.cov, size=self.pop_size
        )
        # Clip to [0, 2] range
        population = np.clip(population, 0, 2)
        return population

    def tell(self, population, scores):
        """Update CMA-ES state based on evaluated scores."""
        # Sort by score (descending)
        order = np.argsort(scores)[::-1]
        sorted_pop = population[order]
        sorted_scores = scores[order]

        # Update best
        if sorted_scores[0] > self.best_score:
            self.best_score = sorted_scores[0]
            self.best_coefficients = sorted_pop[0].copy()

        # Weighted mean update (top half)
        n_elite = self.pop_size // 2
        weights = np.log(n_elite + 0.5) - np.log(np.arange(1, n_elite + 1))
        weights /= weights.sum()

        self.mean = np.sum(
            weights[:, None] * sorted_pop[:n_elite], axis=0
        )

        # Simplified covariance update
        diffs = sorted_pop[:n_elite] - self.mean
        self.cov = np.sum(
            weights[:, None, None] * (diffs[:, :, None] * diffs[:, None, :]),
            axis=0,
        )
        self.cov += 1e-6 * np.eye(self.dim)  # Regularization

    def get_layer_coefficients(self, flat_coefficients):
        """Reshape flat coefficient vector into per-layer, per-task format."""
        return flat_coefficients.reshape(self.n_layers, self.n_tasks)


def evolutionary_merge(
    base_model, task_vectors, eval_fn,
    n_generations=50, population_size=20, device='cuda',
):
    """Find optimal per-layer merge coefficients using CMA-ES."""
    n_layers = count_layers(base_model)
    n_tasks = len(task_vectors)

    optimizer = CMAESMergeOptimizer(n_layers, n_tasks, population_size)

    for gen in range(n_generations):
        population = optimizer.ask()
        scores = np.zeros(len(population))

        for i, coefficients in enumerate(population):
            layer_coeffs = optimizer.get_layer_coefficients(coefficients)

            # Apply per-layer, per-task merge
            merged = apply_layerwise_merge(
                base_model, task_vectors, layer_coeffs
            )
            scores[i] = eval_fn(merged)

            del merged
            torch.cuda.empty_cache()

        optimizer.tell(population, scores)
        print(f"Gen {gen}: best={optimizer.best_score:.4f}, "
              f"mean={scores.mean():.4f}")

    return optimizer.best_coefficients


def count_layers(model):
    """Count transformer layers in a model."""
    if hasattr(model, 'model') and hasattr(model.model, 'layers'):
        return len(model.model.layers)
    if hasattr(model, 'transformer') and hasattr(model.transformer, 'h'):
        return len(model.transformer.h)
    raise ValueError("Cannot determine number of layers")


def apply_layerwise_merge(base_model, task_vectors, layer_coefficients):
    """Apply merge with per-layer, per-task coefficients."""
    merged = copy.deepcopy(base_model)

    for name, param in merged.named_parameters():
        # Determine which layer this parameter belongs to
        layer_idx = extract_layer_index(name)

        if layer_idx is not None and layer_idx < layer_coefficients.shape[0]:
            combined_delta = torch.zeros_like(param.data)
            for task_idx, tv in enumerate(task_vectors):
                coeff = layer_coefficients[layer_idx, task_idx]
                combined_delta += coeff * tv[name]
            param.data.add_(combined_delta)
        else:
            # Non-layer parameters (embeddings, final norm): use mean coefficient
            combined_delta = torch.zeros_like(param.data)
            for task_idx, tv in enumerate(task_vectors):
                coeff = layer_coefficients[:, task_idx].mean()
                combined_delta += coeff * tv[name]
            param.data.add_(combined_delta)

    return merged


def extract_layer_index(param_name):
    """Extract layer index from parameter name like 'model.layers.15.self_attn.q_proj.weight'."""
    import re
    match = re.search(r'layers?[._](\d+)', param_name)
    if match:
        return int(match.group(1))
    return None

5.3 Evolutionary Merge Results

Per-Layer Merge Coefficients (Evolved, 3 Tasks)

(coefficient x 100)
Layer 0 (embed-adjacent) alpha=0.30 (conservative)
30 coefficient x 100
Layer 8 (early) alpha=0.55
55 coefficient x 100
Layer 16 (middle) alpha=0.85
85 coefficient x 100
Layer 24 (late) alpha=0.95 (strong)
95 coefficient x 100
Layer 31 (output-adjacent) alpha=0.70 (moderate)
70 coefficient x 100

The evolved coefficients consistently show a pattern: low for early layers (preserve base model’s fundamental representations), high for middle and late layers (incorporate task-specific knowledge), and moderate for the final layer (avoid overshooting near the output head).

πŸ“Š

Evolutionary vs Uniform Merge Coefficients

MethodCoefficient TypeAvg ScoreSearch Cost (GPU-hours)
Task arithmetic Single global lambda 60.2 0.1 (grid search)
TIES Single global lambda 63.1 0.1
Evolutionary (per-layer) L coefficients 67.8 5-10
Evolutionary (per-layer, per-task) L x K coefficients 68.5 20-50
Note: L=32 layers, K=3 tasks. Per-layer evolutionary search adds 4-5 points over uniform coefficients. Per-task adds another 0.7 points but costs 4x more search compute. The search cost is still negligible compared to training any of the individual models.

6. Implementation: Merging Two LoRA Adapters with TIES

LoRA adapters are the most common merge target because they are small (typically 0.1-1% of model parameters), already structured as task vectors (the adapter weights are the delta from the base model), and can be merged without loading the full model weights.

6.1 LoRA Background

A LoRA adapter replaces a weight matrix W∈RdoutΓ—dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}} with:

Wβ€²=W+Ξ±rBAW' = W + \frac{\alpha}{r} B A

where A∈RrΓ—dinA \in \mathbb{R}^{r \times d_\text{in}}, B∈RdoutΓ—rB \in \mathbb{R}^{d_\text{out} \times r}, rr is the rank (typically 8-64), and Ξ±\alpha is a scaling factor.

The task vector for a LoRA adapter is Ξ±rBA\frac{\alpha}{r} B A. Merging two LoRA adapters is equivalent to merging their task vectors.

6.2 Complete Implementation

import torch
import json
import os
from collections import OrderedDict

class LoRAAdapter:
    """Represents a LoRA adapter with its A and B matrices."""

    def __init__(self, state_dict, rank, alpha, target_modules):
        self.state_dict = state_dict  # {name: tensor}
        self.rank = rank
        self.alpha = alpha
        self.target_modules = target_modules
        self.scaling = alpha / rank

    @classmethod
    def load(cls, path):
        """Load a LoRA adapter from disk."""
        adapter_weights = torch.load(
            os.path.join(path, "adapter_model.bin"), map_location="cpu"
        )
        with open(os.path.join(path, "adapter_config.json")) as f:
            config = json.load(f)

        return cls(
            state_dict=adapter_weights,
            rank=config["r"],
            alpha=config["lora_alpha"],
            target_modules=config["target_modules"],
        )

    def compute_task_vectors(self):
        """Compute the effective weight delta for each target module."""
        task_vectors = {}

        # Find all A/B pairs
        a_matrices = {}
        b_matrices = {}
        for name, tensor in self.state_dict.items():
            if "lora_A" in name:
                base_name = name.replace(".lora_A.weight", "")
                a_matrices[base_name] = tensor
            elif "lora_B" in name:
                base_name = name.replace(".lora_B.weight", "")
                b_matrices[base_name] = tensor

        # Compute delta = (alpha/r) * B @ A
        for base_name in a_matrices:
            A = a_matrices[base_name]   # [r, d_in]
            B = b_matrices[base_name]   # [d_out, r]
            delta = self.scaling * (B @ A)  # [d_out, d_in]
            task_vectors[base_name] = delta

        return task_vectors


def ties_merge_lora(adapters, density=0.2, scaling_factor=1.0):
    """
    Merge multiple LoRA adapters using TIES method.

    Args:
        adapters: list of LoRAAdapter objects
        density: fraction of parameters to keep in trimming step
        scaling_factor: global scaling for the merged result

    Returns:
        dict of merged task vectors {module_name: merged_delta}
    """
    # Step 1: Compute task vectors for each adapter
    all_task_vectors = [adapter.compute_task_vectors() for adapter in adapters]

    # Get all module names (union across all adapters)
    all_modules = set()
    for tvs in all_task_vectors:
        all_modules.update(tvs.keys())

    merged_deltas = {}

    for module_name in sorted(all_modules):
        # Collect deltas for this module from all adapters
        deltas = []
        for tvs in all_task_vectors:
            if module_name in tvs:
                deltas.append(tvs[module_name])
            else:
                # Adapter doesn't modify this module -- zero delta
                reference = next(iter(tvs.values()))
                deltas.append(torch.zeros_like(
                    list(all_task_vectors[0].values())[0]
                    if module_name not in all_task_vectors[0]
                    else all_task_vectors[0][module_name]
                ))

        n_tasks = len(deltas)

        # ===== TRIM =====
        trimmed = []
        for delta in deltas:
            if delta.abs().max() == 0:
                trimmed.append(delta)
                continue
            threshold = delta.abs().quantile(1 - density)
            mask = delta.abs() >= threshold
            trimmed.append(delta * mask.float())

        # ===== ELECT SIGN =====
        stacked = torch.stack(trimmed)  # [n_tasks, d_out, d_in]
        sign_vote = stacked.sum(dim=0)  # Sum magnitudes with signs
        elected_sign = torch.sign(sign_vote)
        elected_sign[elected_sign == 0] = 1.0

        # ===== DISJOINT MERGE =====
        agree_sum = torch.zeros_like(deltas[0])
        agree_count = torch.zeros_like(deltas[0])

        for t in trimmed:
            agrees = (torch.sign(t) == elected_sign) & (t != 0)
            agree_sum += t * agrees.float()
            agree_count += agrees.float()

        merged_delta = agree_sum / agree_count.clamp(min=1)
        merged_deltas[module_name] = scaling_factor * merged_delta

    return merged_deltas


def apply_merged_lora_to_model(base_model, merged_deltas):
    """Apply merged LoRA deltas to the base model weights."""
    merged_model = copy.deepcopy(base_model)
    applied = 0

    for name, param in merged_model.named_parameters():
        # Map parameter name to LoRA module name
        module_name = name.replace(".weight", "")
        if module_name in merged_deltas:
            param.data.add_(merged_deltas[module_name])
            applied += 1

    print(f"Applied merged deltas to {applied} parameters")
    return merged_model

6.3 End-to-End LoRA Merge Example

def merge_lora_adapters_example():
    """Complete example: merge code and math LoRA adapters."""
    from transformers import AutoModelForCausalLM

    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3-8B",
        torch_dtype=torch.bfloat16,
        device_map="cpu",  # CPU for merging (memory)
    )

    # Load LoRA adapters
    code_adapter = LoRAAdapter.load("./code-lora-adapter")
    math_adapter = LoRAAdapter.load("./math-lora-adapter")

    print(f"Code adapter: rank={code_adapter.rank}, "
          f"alpha={code_adapter.alpha}, "
          f"scaling={code_adapter.scaling:.4f}")
    print(f"Math adapter: rank={math_adapter.rank}, "
          f"alpha={math_adapter.alpha}, "
          f"scaling={math_adapter.scaling:.4f}")

    # Analyze task vector properties
    code_tvs = code_adapter.compute_task_vectors()
    math_tvs = math_adapter.compute_task_vectors()

    for name in list(code_tvs.keys())[:3]:
        code_delta = code_tvs[name]
        math_delta = math_tvs[name]

        # Check sign agreement
        code_sign = torch.sign(code_delta)
        math_sign = torch.sign(math_delta)
        agree = (code_sign == math_sign).float().mean()

        # Check magnitude
        code_mag = code_delta.abs().mean()
        math_mag = math_delta.abs().mean()

        print(f"\n{name}:")
        print(f"  Code delta magnitude: {code_mag:.6f}")
        print(f"  Math delta magnitude: {math_mag:.6f}")
        print(f"  Sign agreement: {agree:.2%}")

    # Merge with TIES
    print("\nMerging with TIES (density=0.2)...")
    merged_deltas = ties_merge_lora(
        [code_adapter, math_adapter],
        density=0.2,
        scaling_factor=0.8,
    )

    # Apply to base model
    merged_model = apply_merged_lora_to_model(base_model, merged_deltas)

    # Save merged model
    merged_model.save_pretrained("llama-8b-code-math-merged")
    print("Merged model saved.")

    return merged_model

6.4 Merge Quality Diagnostics

def diagnose_merge_quality(base_model, merged_model, individual_models, eval_datasets):
    """Comprehensive merge quality analysis."""
    results = {'base': {}, 'merged': {}}
    for name in individual_models:
        results[name] = {}

    for task_name, dataset in eval_datasets.items():
        # Evaluate base
        results['base'][task_name] = evaluate(base_model, dataset)

        # Evaluate individual fine-tuned models
        for model_name, model in individual_models.items():
            results[model_name][task_name] = evaluate(model, dataset)

        # Evaluate merged
        results['merged'][task_name] = evaluate(merged_model, dataset)

    # Compute merge efficiency metrics
    print("\n=== Merge Quality Report ===")
    for task_name in eval_datasets:
        base_score = results['base'][task_name]
        merged_score = results['merged'][task_name]
        best_individual = max(
            results[name][task_name] for name in individual_models
        )

        improvement_over_base = merged_score - base_score
        retention_of_best = merged_score / best_individual * 100

        print(f"\n{task_name}:")
        print(f"  Base: {base_score:.2f}")
        print(f"  Best individual: {best_individual:.2f}")
        print(f"  Merged: {merged_score:.2f}")
        print(f"  Improvement over base: +{improvement_over_base:.2f}")
        print(f"  Retention of best: {retention_of_best:.1f}%")

    # Average across tasks
    merged_avg = sum(results['merged'].values()) / len(results['merged'])
    individual_avgs = {
        name: sum(scores.values()) / len(scores)
        for name, scores in results.items()
        if name not in ('base', 'merged')
    }
    best_avg = max(individual_avgs.values())

    print(f"\n=== Summary ===")
    print(f"Merged average: {merged_avg:.2f}")
    print(f"Best individual average: {best_avg:.2f}")
    print(f"Merged outperforms best individual by: "
          f"{merged_avg - best_avg:+.2f}")

7. Practical Guidelines

7.1 Decision Tree

def recommend_merge_method(n_models, same_base, compute_budget_hours):
    """Recommend a merge method based on constraints."""
    if not same_base:
        print("ERROR: Cannot merge models with different base architectures.")
        print("Retrain from a common base or use ensemble instead.")
        return None

    if n_models == 2:
        if compute_budget_hours < 0.1:
            return "Linear averaging (alpha=0.5) or task arithmetic (lambda=0.8)"
        elif compute_budget_hours < 1:
            return "TIES (density=0.2, sweep lambda in [0.5, 1.0])"
        else:
            return "Evolutionary per-layer merge"

    elif n_models <= 5:
        if compute_budget_hours < 0.5:
            return "DARE (p=0.9) + simple average"
        elif compute_budget_hours < 5:
            return "DARE + TIES (density=0.2)"
        else:
            return "Evolutionary per-layer, per-task merge"

    else:  # Many models
        return "DARE (p=0.95) + TIES -- high sparsification essential"

7.2 Hyperparameter Defaults

πŸ“Š

Recommended Hyperparameters by Method

MethodKey ParameterDefault ValueSearch Range
Linear Average alpha 0.5 [0.3, 0.7]
Task Arithmetic lambda 0.8 [0.5, 1.2]
TIES density 0.2 [0.1, 0.5]
TIES scaling_factor 1.0 [0.5, 1.5]
DARE drop_rate 0.9 [0.8, 0.99]
DARE scaling_factor 1.0 [0.5, 1.5]
Evolutionary population_size 20 [10, 50]
Evolutionary n_generations 50 [30, 100]

7.3 Common Pitfalls

  1. Forgetting to subtract the base: Task arithmetic requires computing Wftβˆ’WbaseW_{\text{ft}} - W_{\text{base}}. Using absolute weights instead of deltas produces garbage.

  2. Merging with different tokenizers: Even if two models share the same architecture, different tokenizers mean different embedding matrices. The merge will silently produce wrong token mappings.

  3. Ignoring non-layer parameters: Embeddings, final layer norm, and the LM head are not part of any transformer layer. They need separate merge handling (usually conservative averaging).

  4. Too many models: Each additional model in the merge increases conflicts. Beyond 5-6 models, even DARE + TIES struggles. Use DARE with very high drop rates (p>0.95p > 0.95) or cluster models by task similarity before merging.

🚨 Merging Is Not Free

Model merging has no training cost, but it has a quality cost. A merged model almost never matches a single model fine-tuned on the combined dataset. Merging is a trade-off: zero compute cost in exchange for 5-15% quality degradation versus dedicated training. Use merging when you cannot afford to retrain, or when you need to quickly combine capabilities from multiple sources.


8. Advanced: SLERP and Model Soups

8.1 Spherical Linear Interpolation (SLERP)

Linear interpolation moves along a straight line in weight space. SLERP moves along the surface of a hypersphere, preserving the magnitude of the weight vectors:

Wmerged=sin⁑((1βˆ’Ξ±)ΞΈ)sin⁑θWA+sin⁑(Ξ±ΞΈ)sin⁑θWBW_{\text{merged}} = \frac{\sin((1-\alpha)\theta)}{\sin\theta} W_A + \frac{\sin(\alpha\theta)}{\sin\theta} W_B

where ΞΈ=arccos⁑(WAβ‹…WB∣WA∣∣WB∣)\theta = \arccos\left(\frac{W_A \cdot W_B}{|W_A| |W_B|}\right) is the angle between the weight vectors.

def slerp_merge(model_a, model_b, alpha=0.5):
    """Spherical linear interpolation of model weights."""
    merged = copy.deepcopy(model_a)

    for (name, param_a), (_, param_b) in zip(
        model_a.named_parameters(), model_b.named_parameters()
    ):
        flat_a = param_a.data.flatten().float()
        flat_b = param_b.data.flatten().float()

        # Normalize
        norm_a = flat_a.norm()
        norm_b = flat_b.norm()

        if norm_a < 1e-8 or norm_b < 1e-8:
            # Degenerate case: fall back to linear
            result = alpha * param_a.data + (1 - alpha) * param_b.data
        else:
            unit_a = flat_a / norm_a
            unit_b = flat_b / norm_b

            # Angle between vectors
            cos_theta = torch.clamp(torch.dot(unit_a, unit_b), -1, 1)
            theta = torch.acos(cos_theta)

            if theta.abs() < 1e-6:
                # Nearly parallel: linear interpolation
                result = alpha * param_a.data + (1 - alpha) * param_b.data
            else:
                # SLERP
                sin_theta = torch.sin(theta)
                w_a = torch.sin((1 - alpha) * theta) / sin_theta
                w_b = torch.sin(alpha * theta) / sin_theta
                result_flat = w_a * flat_a + w_b * flat_b
                result = result_flat.reshape(param_a.shape).to(param_a.dtype)

        merged.state_dict()[name].copy_(result)

    return merged

8.2 Model Soups

Model Soups (Wortsman et al., 2022) average multiple fine-tuning runs of the same model on the same data with different hyperparameters. This is distinct from task merging β€” it merges for robustness rather than capability combination.

def greedy_model_soup(models, eval_fn):
    """
    Greedy Model Soup: iteratively add models that improve the average.

    Start with the best individual model. Try adding each remaining model.
    Keep the addition that improves quality the most. Repeat until no
    improvement is possible.
    """
    # Find best individual model
    individual_scores = [(i, eval_fn(m)) for i, m in enumerate(models)]
    individual_scores.sort(key=lambda x: x[1], reverse=True)

    soup_indices = [individual_scores[0][0]]
    current_soup = copy.deepcopy(models[soup_indices[0]])
    current_score = individual_scores[0][1]

    remaining = [i for i, _ in individual_scores[1:]]

    while remaining:
        best_candidate = None
        best_new_score = current_score

        for idx in remaining:
            # Try adding this model to the soup
            trial_models = [models[i] for i in soup_indices] + [models[idx]]
            trial_soup = multi_model_average(trial_models)
            trial_score = eval_fn(trial_soup)

            if trial_score > best_new_score:
                best_new_score = trial_score
                best_candidate = idx

        if best_candidate is not None:
            soup_indices.append(best_candidate)
            remaining.remove(best_candidate)
            current_soup = multi_model_average([models[i] for i in soup_indices])
            current_score = best_new_score
            print(f"Added model {best_candidate}, "
                  f"soup size={len(soup_indices)}, "
                  f"score={current_score:.4f}")
        else:
            print("No improvement possible. Stopping.")
            break

    return current_soup, soup_indices

9. Benchmarking Merge Methods at Scale

def comprehensive_merge_benchmark(
    base_model_name, adapter_paths, eval_datasets, device='cuda',
):
    """Run all merge methods and compare results."""
    from transformers import AutoModelForCausalLM

    # Load base model
    base = AutoModelForCausalLM.from_pretrained(
        base_model_name, torch_dtype=torch.bfloat16, device_map=device
    )

    # Load adapters
    adapters = [LoRAAdapter.load(p) for p in adapter_paths]

    # Compute task vectors
    task_vectors = [a.compute_task_vectors() for a in adapters]

    methods = {
        'Simple Average': lambda tvs: {
            k: sum(tv[k] for tv in tvs) / len(tvs) for k in tvs[0]
        },
        'Task Arithmetic (0.8)': lambda tvs: {
            k: 0.8 * sum(tv[k] for tv in tvs) for k in tvs[0]
        },
        'TIES (d=0.2)': lambda tvs: ties_merge(tvs, density=0.2),
        'DARE (p=0.9)': lambda tvs: {
            k: sum(dare_sparsify(tv, 0.9)[k] for tv in tvs) / len(tvs)
            for k in tvs[0]
        },
        'DARE+TIES': lambda tvs: dare_ties_merge(tvs, drop_rate=0.9, density=0.2),
    }

    all_results = {}
    for method_name, merge_fn in methods.items():
        print(f"\n{'='*60}")
        print(f"Method: {method_name}")
        merged_tv = merge_fn(task_vectors)

        # Apply to base model
        merged_model = copy.deepcopy(base)
        for name, param in merged_model.named_parameters():
            module_name = name.replace(".weight", "")
            if module_name in merged_tv:
                param.data.add_(merged_tv[module_name])

        # Evaluate
        scores = {}
        for task_name, dataset in eval_datasets.items():
            score = evaluate(merged_model, dataset)
            scores[task_name] = score
            print(f"  {task_name}: {score:.2f}")

        scores['average'] = sum(scores.values()) / len(scores)
        all_results[method_name] = scores

        del merged_model
        torch.cuda.empty_cache()

    # Print comparison table
    print(f"\n{'='*60}")
    print("SUMMARY")
    for method_name, scores in sorted(
        all_results.items(), key=lambda x: x[1]['average'], reverse=True
    ):
        print(f"  {method_name:25s}: avg={scores['average']:.2f}")

    return all_results

Merge Method Quality Ranking (Average Across Tasks)

(score x 10)
DARE + TIES 64.1 avg
641 score x 10
Evolutionary (per-layer) 67.8 avg (with search)
678 score x 10
TIES 63.1 avg
631 score x 10
DARE 62.5 avg
625 score x 10
Task Arithmetic 60.2 avg
602 score x 10
Linear Average 48.8 avg
488 score x 10

10. Summary

Model merging is a zero-cost technique for combining capabilities from multiple fine-tuned models. The quality hierarchy is clear:

  1. Evolutionary per-layer merge: Best quality (+18% over linear averaging), but requires 5-50 GPU-hours of search
  2. DARE + TIES: Best automated method (+15% over linear averaging), no search required
  3. TIES: Strong default (+14%), handles sign conflicts
  4. DARE: Simple and effective (+14%), handles conflict through sparsification
  5. Task arithmetic: Reasonable baseline (+11%), requires lambda tuning
  6. Linear averaging: Simplest but weakest, works only when models are very similar

Key constraints to remember:

  • All models must share the same base architecture and tokenizer
  • Fine-tuning should be relatively light (LoRA or short full fine-tuning)
  • More than 5-6 models in a single merge degrades quality significantly
  • Merging is not a substitute for training on combined data (5-15% quality gap)

Reviewer Agent Validation

Challenge: Two LoRA adapters produce task vectors for the same weight matrix. Adapter A has delta values [+0.3,βˆ’0.2,+0.1,βˆ’0.4,+0.05][+0.3, -0.2, +0.1, -0.4, +0.05] and Adapter B has delta values [βˆ’0.1,βˆ’0.3,+0.2,+0.1,βˆ’0.15][-0.1, -0.3, +0.2, +0.1, -0.15]. Apply the TIES merge with density=0.4 (keep top 40% by magnitude in each adapter).

Step 1 β€” Trim: Keep top 40% = top 2 values per adapter (by magnitude).

Adapter A magnitudes: [0.3,0.2,0.1,0.4,0.05][0.3, 0.2, 0.1, 0.4, 0.05]. Top 2: positions 0 (∣0.3∣|0.3|) and 3 (∣0.4∣|0.4|). Trimmed A: [+0.3,0,0,βˆ’0.4,0][+0.3, 0, 0, -0.4, 0]

Adapter B magnitudes: [0.1,0.3,0.2,0.1,0.15][0.1, 0.3, 0.2, 0.1, 0.15]. Top 2: positions 1 (∣0.3∣|0.3|) and 2 (∣0.2∣|0.2|). Trimmed B: [0,βˆ’0.3,+0.2,0,0][0, -0.3, +0.2, 0, 0]

Step 2 β€” Elect Sign: Sum trimmed values at each position. Position 0: +0.3+0=+0.3β†’+0.3 + 0 = +0.3 \to sign = ++ Position 1: 0+(βˆ’0.3)=βˆ’0.3β†’0 + (-0.3) = -0.3 \to sign = βˆ’- Position 2: 0+0.2=+0.2β†’0 + 0.2 = +0.2 \to sign = ++ Position 3: βˆ’0.4+0=βˆ’0.4β†’-0.4 + 0 = -0.4 \to sign = βˆ’- Position 4: 0+0=0β†’0 + 0 = 0 \to sign = ++ (default)

Step 3 β€” Disjoint Merge: Average values that agree with elected sign. Position 0: Elected ++. A has +0.3+0.3 (agrees). B has 00 (skip). Average = +0.3/1=+0.3+0.3 / 1 = +0.3 Position 1: Elected βˆ’-. A has 00 (skip). B has βˆ’0.3-0.3 (agrees). Average = βˆ’0.3/1=βˆ’0.3-0.3 / 1 = -0.3 Position 2: Elected ++. A has 00 (skip). B has +0.2+0.2 (agrees). Average = +0.2/1=+0.2+0.2 / 1 = +0.2 Position 3: Elected βˆ’-. A has βˆ’0.4-0.4 (agrees). B has 00 (skip). Average = βˆ’0.4/1=βˆ’0.4-0.4 / 1 = -0.4 Position 4: No nonzero values. Result = 00.

Result: TIES merged vector = [+0.3,βˆ’0.3,+0.2,βˆ’0.4,0][+0.3, -0.3, +0.2, -0.4, 0].

Compare to naive sum: [+0.2,βˆ’0.5,+0.3,βˆ’0.3,βˆ’0.1][+0.2, -0.5, +0.3, -0.3, -0.1]. TIES preserves the dominant contributor at each position rather than blending conflicting signals. Position 3 retains Adapter A’s strong βˆ’0.4-0.4 instead of being diluted to βˆ’0.3-0.3 by Adapter B’s opposing +0.1+0.1.