Continual Learning and Catastrophic Forgetting: Why Models Lose Old Knowledge When Learning New

Part of Series Frontier Research 2025-2026 19 of 30

1 Reasoning Scaling Laws: How Inference-Time Compute Changes Everything We Know About Scaling 2 Lightning Attention: Implementing Linear-Time Attention for Million-Token Contexts 3 Policy of Thoughts: Test-Time Policy Evolution and Online Reasoning Refinement 4 Test-Time Compute Scaling: When a 1B Model Beats a 405B Model 5 Self-Improving Systems: Models That Generate Their Own Training Data 6 Embodied AI Foundations: World Models, Physical Reasoning, and the Sora/V-JEPA Connection 7 Reward Model Engineering: ORM vs PRM, Verifier Design, and Why Reward Quality Determines Reasoning Quality 8 Constitutional AI and RLHF Alternatives: DPO, KTO, ORPO, and the Post-Training Revolution 9 Long-Context Research: Architectures and Techniques for 1M to 10M+ Token Windows 10 Multimodal Fusion: Early vs Late Fusion, Cross-Attention, and Interleaved Architectures 11 Efficient Fine-Tuning: LoRA, DoRA, QLoRA, GaLore, and LISA — When to Use Each 12 The Research Frontier in 2026: Open Problems and Promising Directions 13 Hallucination Mitigation: Detection, Prevention, and Why LLMs Confidently Produce Nonsense 14 Mechanistic Interpretability: Sparse Autoencoders, Feature Circuits, and Understanding What's Inside 15 GRPO Complete Algorithm: Group Relative Policy Optimization for Reasoning Models 16 Synthetic Reasoning Data: STaR, ReST, and How Models Bootstrap Their Own Training Signal 17 Alignment at Scale: Scalable Oversight, Weak-to-Strong Generalization, and Constitutional AI 18 Agent Architectures: ReAct, Tool Use, Multi-Step Planning, and Memory Systems for LLM Agents 19 Continual Learning and Catastrophic Forgetting: Why Models Lose Old Knowledge When Learning New 20 Multimodal Generation: Text-to-Image, Text-to-Video, and Unified Generation Architectures 21 Model Evaluation Beyond Benchmarks: Arena, Human Preference, and Capability Elicitation 22 Scaling Laws Complete: Kaplan, Chinchilla, Inference-Time, and the Multi-Dimensional Frontier 23 World Models: Predicting Future States from Actions for Planning and Simulation 24 Tool Use and Function Calling: How LLMs Learn to Use APIs, Calculators, and Code Interpreters 25 Safety and Red Teaming: Adversarial Attacks, Jailbreaks, and Defense Mechanisms 26 Knowledge Editing: ROME, MEMIT, and Surgically Modifying What LLMs Know 27 Chain-of-Thought Internals: What Happens Inside the Model During Reasoning 28 Sparse Upcycling: Converting Dense Models to MoE Without Retraining from Scratch 29 Data-Efficient Training: Learning More from Less with Curriculum, Filtering, and Replay 30 The Open Source LLM Ecosystem in 2026: HuggingFace, Ollama, and the Tools That Changed Everything

Fine-tune a language model on medical text and it gets better at medicine. Test it on the general knowledge it had before fine-tuning and performance drops 5-15%. Fine-tune it further on legal text and the medical knowledge degrades. This is catastrophic forgetting: the gradient updates that optimize for new data overwrite the parameters that encoded old knowledge. The more you specialize, the more you lose.

This is not a minor inconvenience. Production systems need models that can be continuously updated — absorbing new knowledge, adapting to new domains, fixing errors — without regressing on existing capabilities. The alternative is retraining from scratch every time, which costs millions of dollars and takes weeks.

This post covers the mechanisms of catastrophic forgetting, measurement frameworks, and the five major mitigation strategies: Elastic Weight Consolidation, replay buffers, progressive networks, parameter isolation (LoRA), and knowledge distillation.

Why Catastrophic Forgetting Happens

The Parameter Interference Problem

Neural networks store knowledge distributed across all parameters. There is no “medical knowledge” region and “legal knowledge” region — the same parameters contribute to both. When gradient descent optimizes for new data, it moves parameters in directions that improve the new loss, without considering what those movements do to the old loss.

import torch
import torch.nn as nn
import numpy as np
from copy import deepcopy

class ForgettingDemonstration:
    """
    Demonstrate catastrophic forgetting with a simple example.
    Train on Task A, then Task B, measure Task A performance.
    """

    def __init__(self, input_dim=100, hidden_dim=256, output_dim=10):
        self.model = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )

    def train_task(self, train_data, epochs=50, lr=1e-3):
        """Train on a single task."""
        optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss()

        for epoch in range(epochs):
            for x, y in train_data:
                optimizer.zero_grad()
                output = self.model(x)
                loss = criterion(output, y)
                loss.backward()
                optimizer.step()

    def evaluate(self, test_data):
        """Evaluate accuracy on a test set."""
        self.model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for x, y in test_data:
                output = self.model(x)
                pred = output.argmax(dim=-1)
                correct += (pred == y).sum().item()
                total += y.size(0)
        self.model.train()
        return correct / total

    def demonstrate_forgetting(self, task_a_train, task_a_test,
                               task_b_train, task_b_test):
        """
        Show forgetting: train A, measure A, train B, measure A again.
        """
        # Train on Task A
        self.train_task(task_a_train, epochs=50)
        acc_a_after_a = self.evaluate(task_a_test)

        # Save parameters after Task A
        params_after_a = {n: p.clone() for n, p in self.model.named_parameters()}

        # Train on Task B
        self.train_task(task_b_train, epochs=50)
        acc_a_after_b = self.evaluate(task_a_test)
        acc_b_after_b = self.evaluate(task_b_test)

        # Compute parameter drift
        params_after_b = {n: p.clone() for n, p in self.model.named_parameters()}
        drift = {}
        for name in params_after_a:
            diff = (params_after_b[name] - params_after_a[name]).norm().item()
            original = params_after_a[name].norm().item()
            drift[name] = diff / max(original, 1e-8)

        return {
            'task_a_accuracy_after_training_a': acc_a_after_a,
            'task_a_accuracy_after_training_b': acc_a_after_b,
            'forgetting': acc_a_after_a - acc_a_after_b,
            'task_b_accuracy': acc_b_after_b,
            'parameter_drift': drift,
        }

Catastrophic Forgetting in LLM Fine-Tuning

Metric	0	500	1000	2000	5000	10000
New Domain (Medical)	32	58	71	79	84	86
Original Domain (General)	78	74	69	63	55	48
Average Across Domains	55	66	70	71	69.5	67

Measuring Forgetting in LLMs

class ForgettingBenchmark:
    """
    Systematic measurement of forgetting in language models.
    Evaluates on multiple benchmarks before and after fine-tuning.
    """

    def __init__(self, model, tokenizer, benchmarks):
        """
        Args:
            model: The language model
            tokenizer: Model's tokenizer
            benchmarks: Dict of benchmark_name -> evaluation function
        """
        self.model = model
        self.tokenizer = tokenizer
        self.benchmarks = benchmarks

    def measure_before(self):
        """Measure performance on all benchmarks before fine-tuning."""
        results = {}
        for name, eval_fn in self.benchmarks.items():
            score = eval_fn(self.model, self.tokenizer)
            results[name] = score
        return results

    def measure_after(self):
        """Measure performance after fine-tuning."""
        return self.measure_before()  # Same function, different model state

    def compute_forgetting_metrics(self, before, after):
        """
        Compute forgetting metrics.

        Returns:
        - Per-benchmark forgetting (before - after)
        - Average forgetting across benchmarks
        - Backward Transfer (BWT): average performance change on old tasks
        - Forward Transfer (FWT): performance on new tasks
        """
        forgetting = {}
        for name in before:
            if name in after:
                forgetting[name] = {
                    'before': before[name],
                    'after': after[name],
                    'delta': after[name] - before[name],
                    'relative_change': (after[name] - before[name]) / max(before[name], 1e-8),
                }

        avg_forgetting = np.mean([
            f['delta'] for f in forgetting.values()
        ])

        return {
            'per_benchmark': forgetting,
            'average_forgetting': avg_forgetting,
            'worst_forgetting': min(f['delta'] for f in forgetting.values()),
            'best_retention': max(f['delta'] for f in forgetting.values()),
        }

📊

Forgetting After Domain-Specific Fine-Tuning (Llama 2 7B)

Benchmark	Before FT	After Medical FT	After Legal FT	After Both
MMLU (General)	45.3%	41.8% (-3.5)	42.1% (-3.2)	38.7% (-6.6)
HellaSwag	76.2%	73.5% (-2.7)	74.1% (-2.1)	70.8% (-5.4)
ARC-Challenge	52.1%	49.8% (-2.3)	50.2% (-1.9)	47.5% (-4.6)
GSM8K (Math)	14.2%	12.1% (-2.1)	13.5% (-0.7)	10.8% (-3.4)
MedQA (Medical)	38.5%	62.3% (+23.8)	39.1% (+0.6)	45.7% (+7.2)
LegalBench	41.2%	42.0% (+0.8)	58.7% (+17.5)	49.3% (+8.1)

Note: Sequential fine-tuning on Medical then Legal causes compound forgetting. Medical accuracy drops from 62.3% (after medical FT) to 45.7% (after both FTs). The model loses 16.6% of its medical knowledge when learning legal.

Elastic Weight Consolidation (EWC)

The Idea

EWC (Kirkpatrick et al., 2017) adds a regularization term that penalizes changes to parameters that are important for previous tasks. Importance is measured by the Fisher Information Matrix: parameters with high Fisher information for Task A contribute significantly to Task A’s performance and should be changed minimally when learning Task B.

class EWC:
    """
    Elastic Weight Consolidation.

    Adds a penalty for changing parameters that are important for old tasks:
    L_total = L_new + (lambda/2) * sum_i F_i * (theta_i - theta_i*)^2

    Where F_i is the Fisher information of parameter i for the old task,
    and theta_i* is the parameter value after training on the old task.
    """

    def __init__(self, model, old_task_dataloader, ewc_lambda=1000.0):
        self.model = model
        self.ewc_lambda = ewc_lambda

        # Store reference parameters (after training on old task)
        self.reference_params = {
            n: p.clone().detach()
            for n, p in model.named_parameters()
            if p.requires_grad
        }

        # Compute Fisher Information Matrix (diagonal approximation)
        self.fisher = self._compute_fisher(old_task_dataloader)

    def _compute_fisher(self, dataloader, num_samples=1000):
        """
        Compute diagonal Fisher Information Matrix.

        Fisher = E[gradient^2] = E[(d log p(y|x, theta) / d theta)^2]

        Approximated by averaging squared gradients over data samples.
        """
        self.model.eval()
        fisher = {n: torch.zeros_like(p) for n, p in self.model.named_parameters()
                  if p.requires_grad}

        count = 0
        for batch in dataloader:
            if count >= num_samples:
                break

            input_ids = batch['input_ids'].to(next(self.model.parameters()).device)
            attention_mask = batch.get('attention_mask', None)
            if attention_mask is not None:
                attention_mask = attention_mask.to(input_ids.device)

            self.model.zero_grad()
            outputs = self.model(input_ids=input_ids,
                                attention_mask=attention_mask,
                                labels=input_ids)
            loss = outputs.loss
            loss.backward()

            for n, p in self.model.named_parameters():
                if p.requires_grad and p.grad is not None:
                    fisher[n] += p.grad.data.pow(2)

            count += input_ids.size(0)

        # Average
        for n in fisher:
            fisher[n] /= count

        self.model.train()
        return fisher

    def penalty(self):
        """
        Compute EWC penalty term.
        Add this to the new task loss during training.
        """
        penalty = 0.0
        for n, p in self.model.named_parameters():
            if n in self.reference_params and n in self.fisher:
                diff = p - self.reference_params[n]
                penalty += (self.fisher[n] * diff.pow(2)).sum()

        return (self.ewc_lambda / 2) * penalty

    def train_with_ewc(self, new_task_dataloader, optimizer, epochs=5):
        """Train on new task with EWC regularization."""
        self.model.train()

        for epoch in range(epochs):
            total_loss = 0.0
            total_ewc_penalty = 0.0
            n_batches = 0

            for batch in new_task_dataloader:
                input_ids = batch['input_ids'].to(
                    next(self.model.parameters()).device
                )
                attention_mask = batch.get('attention_mask', None)
                if attention_mask is not None:
                    attention_mask = attention_mask.to(input_ids.device)

                optimizer.zero_grad()

                outputs = self.model(input_ids=input_ids,
                                    attention_mask=attention_mask,
                                    labels=input_ids)
                new_loss = outputs.loss
                ewc_pen = self.penalty()

                total = new_loss + ewc_pen
                total.backward()

                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                optimizer.step()

                total_loss += new_loss.item()
                total_ewc_penalty += ewc_pen.item()
                n_batches += 1

            print(f"Epoch {epoch + 1}: "
                  f"task_loss={total_loss / n_batches:.4f}, "
                  f"ewc_penalty={total_ewc_penalty / n_batches:.4f}")

ℹ️ Note

The EWC lambda hyperparameter controls the tradeoff: higher lambda means less forgetting but slower learning of new tasks. In practice, lambda values of 100-10000 work well for LLMs. The optimal value depends on how dissimilar the new task is from old tasks — more dissimilar tasks require lower lambda to allow sufficient parameter movement.

Online EWC: Accumulating Over Multiple Tasks

class OnlineEWC:
    """
    Online EWC: efficiently handle a sequence of tasks
    without storing Fisher for each task separately.

    Accumulates Fisher information across tasks using
    exponential moving average.
    """

    def __init__(self, model, ewc_lambda=1000.0, gamma=0.95):
        self.model = model
        self.ewc_lambda = ewc_lambda
        self.gamma = gamma  # Decay factor for old Fisher values

        self.accumulated_fisher = None
        self.reference_params = None

    def register_task(self, dataloader, num_samples=1000):
        """Register a completed task: update Fisher and reference params."""
        new_fisher = self._compute_fisher(dataloader, num_samples)

        if self.accumulated_fisher is None:
            self.accumulated_fisher = new_fisher
        else:
            # Exponential moving average
            for n in self.accumulated_fisher:
                self.accumulated_fisher[n] = (
                    self.gamma * self.accumulated_fisher[n] + new_fisher[n]
                )

        # Update reference parameters
        self.reference_params = {
            n: p.clone().detach()
            for n, p in self.model.named_parameters()
            if p.requires_grad
        }

    def _compute_fisher(self, dataloader, num_samples):
        """Same Fisher computation as standard EWC."""
        self.model.eval()
        fisher = {n: torch.zeros_like(p) for n, p in self.model.named_parameters()
                  if p.requires_grad}
        count = 0
        for batch in dataloader:
            if count >= num_samples:
                break
            input_ids = batch['input_ids'].to(next(self.model.parameters()).device)
            self.model.zero_grad()
            outputs = self.model(input_ids=input_ids, labels=input_ids)
            outputs.loss.backward()
            for n, p in self.model.named_parameters():
                if p.requires_grad and p.grad is not None:
                    fisher[n] += p.grad.data.pow(2)
            count += input_ids.size(0)
        for n in fisher:
            fisher[n] /= max(count, 1)
        self.model.train()
        return fisher

    def penalty(self):
        """EWC penalty using accumulated Fisher."""
        if self.accumulated_fisher is None or self.reference_params is None:
            return torch.tensor(0.0)

        penalty = 0.0
        for n, p in self.model.named_parameters():
            if n in self.reference_params and n in self.accumulated_fisher:
                diff = p - self.reference_params[n]
                penalty += (self.accumulated_fisher[n] * diff.pow(2)).sum()

        return (self.ewc_lambda / 2) * penalty

Replay Buffers

Experience Replay

Store a small fraction of old task data and mix it into new task training. This prevents the model from completely overwriting old knowledge.

import random
from collections import deque

class ReplayBuffer:
    """
    Experience replay buffer for continual learning.
    Stores representative examples from previous tasks.
    """

    def __init__(self, max_size=10000, selection_strategy="random"):
        self.max_size = max_size
        self.buffer = deque(maxlen=max_size)
        self.strategy = selection_strategy
        self.task_counts = {}  # Track examples per task

    def add_task_data(self, task_name, examples, num_to_store=None):
        """
        Add examples from a completed task to the buffer.

        Args:
            task_name: Name of the task
            examples: List of training examples
            num_to_store: How many to keep (default: proportional allocation)
        """
        if num_to_store is None:
            # Proportional allocation across tasks
            num_tasks = len(self.task_counts) + 1
            num_to_store = self.max_size // num_tasks

        if self.strategy == "random":
            selected = random.sample(examples, min(num_to_store, len(examples)))
        elif self.strategy == "diverse":
            selected = self._select_diverse(examples, num_to_store)
        elif self.strategy == "difficult":
            selected = self._select_difficult(examples, num_to_store)
        else:
            selected = examples[:num_to_store]

        for ex in selected:
            self.buffer.append({
                'task': task_name,
                'data': ex,
            })

        self.task_counts[task_name] = len(selected)

        # Rebalance if needed
        self._rebalance()

    def _select_diverse(self, examples, num_to_store):
        """Select diverse examples using k-means on embeddings."""
        # Placeholder: implement with embedding clustering
        return random.sample(examples, min(num_to_store, len(examples)))

    def _select_difficult(self, examples, num_to_store):
        """Select examples the model found most difficult (highest loss)."""
        # Placeholder: sort by loss and take hardest
        return random.sample(examples, min(num_to_store, len(examples)))

    def _rebalance(self):
        """Ensure equal representation of all tasks."""
        if not self.task_counts:
            return

        per_task = self.max_size // len(self.task_counts)
        task_examples = {}

        for item in self.buffer:
            task = item['task']
            if task not in task_examples:
                task_examples[task] = []
            task_examples[task].append(item)

        self.buffer.clear()
        for task, examples in task_examples.items():
            selected = examples[:per_task]
            self.buffer.extend(selected)

    def sample_batch(self, batch_size):
        """Sample a batch of replay examples."""
        if len(self.buffer) < batch_size:
            return list(self.buffer)
        return random.sample(list(self.buffer), batch_size)

class ReplayTrainer:
    """
    Train with replay buffer mixing.
    Each batch = new_task_examples + replay_examples.
    """

    def __init__(self, model, replay_buffer, replay_ratio=0.3):
        self.model = model
        self.replay_buffer = replay_buffer
        self.replay_ratio = replay_ratio  # 30% of each batch is replay

    def train_epoch(self, new_task_dataloader, optimizer, tokenizer):
        """Train one epoch with replay mixing."""
        self.model.train()
        total_loss = 0
        n_batches = 0

        for batch in new_task_dataloader:
            batch_size = batch['input_ids'].size(0)
            replay_size = max(1, int(batch_size * self.replay_ratio))

            # Get replay examples
            replay_examples = self.replay_buffer.sample_batch(replay_size)

            if replay_examples:
                replay_texts = [ex['data']['text'] for ex in replay_examples]
                replay_encoded = tokenizer(
                    replay_texts, return_tensors="pt",
                    truncation=True, max_length=512, padding=True,
                )

                # Compute loss on new data
                new_input_ids = batch['input_ids'].to(
                    next(self.model.parameters()).device
                )
                new_outputs = self.model(input_ids=new_input_ids,
                                        labels=new_input_ids)
                new_loss = new_outputs.loss

                # Compute loss on replay data
                replay_input_ids = replay_encoded['input_ids'].to(
                    next(self.model.parameters()).device
                )
                replay_outputs = self.model(input_ids=replay_input_ids,
                                           labels=replay_input_ids)
                replay_loss = replay_outputs.loss

                # Combined loss
                loss = (1 - self.replay_ratio) * new_loss + \
                       self.replay_ratio * replay_loss
            else:
                new_input_ids = batch['input_ids'].to(
                    next(self.model.parameters()).device
                )
                outputs = self.model(input_ids=new_input_ids,
                                    labels=new_input_ids)
                loss = outputs.loss

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
            optimizer.step()

            total_loss += loss.item()
            n_batches += 1

        return total_loss / n_batches

LoRA as Continual Learning

Parameter Isolation by Design

LoRA (Low-Rank Adaptation) naturally prevents catastrophic forgetting by freezing the base model and training only small adapter matrices. Each task gets its own LoRA adapter; the base model is never modified.

class LoRAContinualLearning:
    """
    LoRA for continual learning: each task gets a separate adapter.
    Base model weights are frozen and never modified.
    """

    def __init__(self, base_model, rank=16, alpha=32):
        self.base_model = base_model
        self.rank = rank
        self.alpha = alpha
        self.adapters = {}  # task_name -> LoRA weights

        # Freeze base model
        for param in self.base_model.parameters():
            param.requires_grad = False

    def create_adapter(self, task_name):
        """Create a new LoRA adapter for a task."""
        adapter = {}
        for name, module in self.base_model.named_modules():
            if isinstance(module, nn.Linear) and 'q_proj' in name or 'v_proj' in name:
                in_features = module.in_features
                out_features = module.out_features

                adapter[name] = {
                    'A': nn.Parameter(torch.randn(in_features, self.rank) * 0.01),
                    'B': nn.Parameter(torch.zeros(self.rank, out_features)),
                }

        self.adapters[task_name] = adapter
        return adapter

    def get_merged_output(self, module_name, x, task_name):
        """
        Compute output = base(x) + (alpha/rank) * x @ A @ B
        """
        base_module = dict(self.base_model.named_modules())[module_name]
        base_output = base_module(x)

        if task_name in self.adapters and module_name in self.adapters[task_name]:
            adapter = self.adapters[task_name][module_name]
            lora_output = x @ adapter['A'] @ adapter['B']
            return base_output + (self.alpha / self.rank) * lora_output

        return base_output

    def train_adapter(self, task_name, dataloader, epochs=3, lr=1e-4):
        """Train a LoRA adapter for a specific task."""
        adapter = self.create_adapter(task_name)

        # Only optimize adapter parameters
        adapter_params = []
        for module_params in adapter.values():
            adapter_params.extend([module_params['A'], module_params['B']])

        optimizer = torch.optim.AdamW(adapter_params, lr=lr)

        for epoch in range(epochs):
            total_loss = 0
            n_batches = 0
            for batch in dataloader:
                optimizer.zero_grad()
                # Forward pass with adapter
                # (simplified; real implementation hooks into model forward)
                loss = self._forward_with_adapter(batch, task_name)
                loss.backward()
                optimizer.step()
                total_loss += loss.item()
                n_batches += 1

            print(f"Task {task_name}, Epoch {epoch + 1}: "
                  f"loss={total_loss / n_batches:.4f}")

    def _forward_with_adapter(self, batch, task_name):
        """Forward pass using the specified adapter. Placeholder."""
        return torch.tensor(0.0, requires_grad=True)

    def evaluate_all_tasks(self, task_dataloaders):
        """
        Evaluate on all tasks by switching adapters.
        No forgetting possible because base model is frozen.
        """
        results = {}
        for task_name, dataloader in task_dataloaders.items():
            if task_name in self.adapters:
                accuracy = self._evaluate_with_adapter(dataloader, task_name)
            else:
                accuracy = self._evaluate_base(dataloader)
            results[task_name] = accuracy
        return results

    def _evaluate_with_adapter(self, dataloader, task_name):
        """Evaluate with a specific adapter."""
        return 0.0  # Placeholder

    def _evaluate_base(self, dataloader):
        """Evaluate with base model only."""
        return 0.0  # Placeholder

📊

Continual Learning Methods Comparison (5 Sequential Tasks)

Method	Avg New Task Acc	Avg Old Task Retention	Forgetting	Extra Memory
Naive Fine-Tuning	82.5%	54.3%	-28.2%	0
EWC (lambda=1000)	78.1%	68.5%	-13.8%	+100% (Fisher)
Replay Buffer (10%)	80.3%	71.2%	-10.1%	+10% (data)
EWC + Replay	77.8%	74.5%	-7.0%	+110%
LoRA (per-task)	79.5%	82.5%*	0%*	+2% per task
Full Retrain	82.5%	82.5%	0%	+100% compute

Note: * LoRA achieves zero forgetting on old tasks because it does not modify base model weights. The tradeoff: LoRA adapters must be stored and switched at inference time. EWC + Replay provides the best balance without requiring adapter switching.

Knowledge Distillation for Continual Learning

The Teacher-Student Approach

class DistillationContinualLearning:
    """
    Use the model before fine-tuning as a teacher
    to prevent forgetting during fine-tuning.

    Loss = alpha * new_task_loss + (1-alpha) * distillation_loss
    """

    def __init__(self, student_model, alpha=0.5, temperature=2.0):
        self.student = student_model
        # Take a snapshot of the model before fine-tuning
        self.teacher = deepcopy(student_model)
        for param in self.teacher.parameters():
            param.requires_grad = False
        self.teacher.eval()

        self.alpha = alpha
        self.temperature = temperature

    def distillation_loss(self, student_logits, teacher_logits):
        """
        KL divergence between teacher and student distributions.

        Uses temperature scaling to soften distributions,
        making the student learn more from the teacher's
        uncertainty patterns.
        """
        student_probs = torch.softmax(
            student_logits / self.temperature, dim=-1
        )
        teacher_probs = torch.softmax(
            teacher_logits / self.temperature, dim=-1
        )

        kl_loss = nn.KLDivLoss(reduction='batchmean')(
            torch.log(student_probs + 1e-8),
            teacher_probs,
        )

        # Scale by T^2 (standard distillation trick)
        return kl_loss * (self.temperature ** 2)

    def train_step(self, batch, optimizer):
        """One training step with distillation."""
        input_ids = batch['input_ids'].to(
            next(self.student.parameters()).device
        )
        labels = batch.get('labels', input_ids)

        # Student forward
        student_outputs = self.student(input_ids=input_ids, labels=labels)
        new_task_loss = student_outputs.loss
        student_logits = student_outputs.logits

        # Teacher forward (no grad)
        with torch.no_grad():
            teacher_outputs = self.teacher(input_ids=input_ids)
            teacher_logits = teacher_outputs.logits

        # Combined loss
        distill_loss = self.distillation_loss(student_logits, teacher_logits)
        total_loss = (self.alpha * new_task_loss +
                      (1 - self.alpha) * distill_loss)

        optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.student.parameters(), 1.0)
        optimizer.step()

        return {
            'total_loss': total_loss.item(),
            'task_loss': new_task_loss.item(),
            'distill_loss': distill_loss.item(),
        }

⚡ Performance

Knowledge distillation doubles the memory requirement during training (teacher + student) and adds approximately 30% to forward pass time. For large models (70B+), this can be prohibitive. A practical alternative: compute teacher logits offline and cache them, eliminating the need to keep the teacher in memory during training. For a 100K-example fine-tuning dataset with 512-token sequences and 128K vocabulary, this cache requires approximately 6.5TB — still large but feasible with disk-based access.

Detecting Forgetting in Production

class ForgettingMonitor:
    """
    Monitor for catastrophic forgetting in production.
    Run periodic evaluations on held-out benchmarks.
    Alert if performance drops below threshold.
    """

    def __init__(self, model, benchmarks, alert_threshold=-0.05):
        self.model = model
        self.benchmarks = benchmarks
        self.alert_threshold = alert_threshold  # -5% relative drop
        self.baseline_scores = {}
        self.history = []

    def set_baseline(self):
        """Record baseline scores before any fine-tuning."""
        self.baseline_scores = {}
        for name, eval_fn in self.benchmarks.items():
            score = eval_fn(self.model)
            self.baseline_scores[name] = score

    def check(self):
        """
        Run evaluation and check for forgetting.
        Returns list of alerts for benchmarks that degraded.
        """
        current_scores = {}
        alerts = []

        for name, eval_fn in self.benchmarks.items():
            score = eval_fn(self.model)
            current_scores[name] = score

            if name in self.baseline_scores:
                baseline = self.baseline_scores[name]
                relative_change = (score - baseline) / max(baseline, 1e-8)

                if relative_change < self.alert_threshold:
                    alerts.append({
                        'benchmark': name,
                        'baseline': baseline,
                        'current': score,
                        'relative_change': relative_change,
                        'severity': 'critical' if relative_change < -0.10 else 'warning',
                    })

        self.history.append({
            'timestamp': time.time(),
            'scores': current_scores,
            'alerts': alerts,
        })

        return alerts

Mitigation Method Compute Overhead

Metric	Naive FT	EWC	Replay 10%	Distillation	LoRA	EWC + Replay
Training Time Multiplier	1	1.15	1.1	1.3	0.15	1.25

Catastrophic forgetting is the central obstacle to continuously updatable models. No single mitigation eliminates it completely: EWC constrains learning, replay requires data storage, distillation doubles memory, and LoRA requires adapter management. Production systems typically combine replay buffers (cheapest, most effective per unit cost) with LoRA (zero forgetting for clearly separable tasks). The open research question is whether a single model can truly learn continuously without any degradation — or whether the parameter interference problem is fundamental to current neural network architectures.