GRPO Complete Algorithm: Group Relative Policy Optimization for Reasoning Models

Part of Series Frontier Research 2025-2026 15 of 30

1 Reasoning Scaling Laws: How Inference-Time Compute Changes Everything We Know About Scaling 2 Lightning Attention: Implementing Linear-Time Attention for Million-Token Contexts 3 Policy of Thoughts: Test-Time Policy Evolution and Online Reasoning Refinement 4 Test-Time Compute Scaling: When a 1B Model Beats a 405B Model 5 Self-Improving Systems: Models That Generate Their Own Training Data 6 Embodied AI Foundations: World Models, Physical Reasoning, and the Sora/V-JEPA Connection 7 Reward Model Engineering: ORM vs PRM, Verifier Design, and Why Reward Quality Determines Reasoning Quality 8 Constitutional AI and RLHF Alternatives: DPO, KTO, ORPO, and the Post-Training Revolution 9 Long-Context Research: Architectures and Techniques for 1M to 10M+ Token Windows 10 Multimodal Fusion: Early vs Late Fusion, Cross-Attention, and Interleaved Architectures 11 Efficient Fine-Tuning: LoRA, DoRA, QLoRA, GaLore, and LISA — When to Use Each 12 The Research Frontier in 2026: Open Problems and Promising Directions 13 Hallucination Mitigation: Detection, Prevention, and Why LLMs Confidently Produce Nonsense 14 Mechanistic Interpretability: Sparse Autoencoders, Feature Circuits, and Understanding What's Inside 15 GRPO Complete Algorithm: Group Relative Policy Optimization for Reasoning Models 16 Synthetic Reasoning Data: STaR, ReST, and How Models Bootstrap Their Own Training Signal 17 Alignment at Scale: Scalable Oversight, Weak-to-Strong Generalization, and Constitutional AI 18 Agent Architectures: ReAct, Tool Use, Multi-Step Planning, and Memory Systems for LLM Agents 19 Continual Learning and Catastrophic Forgetting: Why Models Lose Old Knowledge When Learning New 20 Multimodal Generation: Text-to-Image, Text-to-Video, and Unified Generation Architectures 21 Model Evaluation Beyond Benchmarks: Arena, Human Preference, and Capability Elicitation 22 Scaling Laws Complete: Kaplan, Chinchilla, Inference-Time, and the Multi-Dimensional Frontier 23 World Models: Predicting Future States from Actions for Planning and Simulation 24 Tool Use and Function Calling: How LLMs Learn to Use APIs, Calculators, and Code Interpreters 25 Safety and Red Teaming: Adversarial Attacks, Jailbreaks, and Defense Mechanisms 26 Knowledge Editing: ROME, MEMIT, and Surgically Modifying What LLMs Know 27 Chain-of-Thought Internals: What Happens Inside the Model During Reasoning 28 Sparse Upcycling: Converting Dense Models to MoE Without Retraining from Scratch 29 Data-Efficient Training: Learning More from Less with Curriculum, Filtering, and Replay 30 The Open Source LLM Ecosystem in 2026: HuggingFace, Ollama, and the Tools That Changed Everything

Why GRPO Over PPO

PPO (Proximal Policy Optimization) requires four models in memory simultaneously:

Policy model (the model being trained): 140 GB for 70B FP16
Reference model (frozen copy of the initial policy): 140 GB
Reward model (scores outputs): 140 GB
Value model (critic, predicts expected return): 140 GB

Total: 560 GB for a 70B model. Requires 8 H100 GPUs just for model weights.

GRPO eliminates the value model entirely by using within-group comparisons as the baseline:

Policy model: 140 GB
Reference model: 140 GB (can be quantized to 35 GB)
Reward model: 140 GB (can be external API)

Total: 280-315 GB — nearly half the memory of PPO.

The GRPO Algorithm

Σ Theorem: GRPO Update Rule

For a prompt $x$ , generate $K$ completions $\{y_1, \ldots, y_K\}$ from the current policy $\pi_\theta$ . Compute rewards $\{r_1, \ldots, r_K\}$ . The GRPO loss:

$\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{K} \sum_{i=1}^{K} \hat{A}_i \cdot \min\left(\rho_i, \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon)\right) + \beta \cdot D_{KL}(\pi_\theta \| \pi_{\text{ref}})$

where $\rho_i = \frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}$ is the importance ratio, $\hat{A}_i$ is the group-relative advantage, and $\beta$ is the KL penalty coefficient.

Step 1: Group Sampling

For each prompt, generate $K$ completions (typically $K = 4$ to $16$ ):

def group_sample(model, prompt, K=8, max_tokens=2048, temperature=1.0):
    """Generate K completions for one prompt."""
    completions = []
    log_probs = []

    for _ in range(K):
        tokens, lp = model.generate_with_logprobs(
            prompt,
            max_new_tokens=max_tokens,
            temperature=temperature,
        )
        completions.append(tokens)
        log_probs.append(lp)  # Sum of log P(token_t | tokens_0..t-1)

    return completions, log_probs

Step 2: Compute Rewards

Score each completion with the reward model:

def compute_rewards(reward_model, prompt, completions):
    """Score each completion. Higher = better."""
    rewards = []
    for completion in completions:
        # ORM: score based on final answer correctness
        # PRM: score based on step-by-step reasoning quality
        r = reward_model.score(prompt, completion)
        rewards.append(r)
    return torch.tensor(rewards)

Step 3: Group-Relative Advantage

The key GRPO innovation — no value model needed. The advantage is computed relative to the group:

def compute_group_advantage(rewards):
    """
    Compute advantage relative to group mean/std.
    This replaces the PPO critic (value model).

    rewards: [K] — reward for each completion in the group
    Returns: [K] — normalized advantages
    """
    mean = rewards.mean()
    std = rewards.std()

    if std < 1e-8:
        # All rewards equal — no signal
        return torch.zeros_like(rewards)

    advantages = (rewards - mean) / (std + 1e-8)
    return advantages

⚡ Why Group-Relative Works

In PPO, the advantage is $A_i = r_i - V(x)$ where $V(x)$ is the value model’s prediction. In GRPO, the advantage is $\hat{A}_i = (r_i - \bar{r}_{\text{group}}) / \sigma_{\text{group}}$ . The group mean replaces the value model — it is an unbiased estimator of the expected reward. The group std normalizes the scale. This works because with $K \geq 4$ samples, the group statistics are reliable enough for stable training.

Step 4: Policy Gradient with Clipping

The clipped surrogate loss prevents the policy from changing too much in one update:

def grpo_loss(
    policy_model,
    ref_model,
    prompts,
    completions,
    advantages,
    old_log_probs,
    clip_eps=0.2,
    kl_coeff=0.01,
):
    """
    Compute GRPO loss for a batch of prompt-completion pairs.

    policy_model: current policy being optimized
    ref_model: frozen reference (initial SFT checkpoint)
    prompts: list of prompt token sequences
    completions: list of completion token sequences
    advantages: [batch] normalized advantages
    old_log_probs: [batch] log probs under policy at sampling time
    """
    total_loss = torch.tensor(0.0, device="cuda")

    for i in range(len(prompts)):
        # Compute current log probability
        input_ids = torch.cat([prompts[i], completions[i]])
        current_log_prob = policy_model.compute_log_prob(
            input_ids, completion_start=len(prompts[i])
        )

        # Compute reference log probability (for KL penalty)
        with torch.no_grad():
            ref_log_prob = ref_model.compute_log_prob(
                input_ids, completion_start=len(prompts[i])
            )

        # Importance ratio
        ratio = torch.exp(current_log_prob - old_log_probs[i])

        # Clipped surrogate
        clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
        surrogate = torch.min(ratio * advantages[i], clipped_ratio * advantages[i])

        # KL divergence penalty (keeps policy close to reference)
        kl = current_log_prob - ref_log_prob

        # Combined loss (negative because we maximize the surrogate)
        total_loss += -surrogate + kl_coeff * kl

    return total_loss / len(prompts)

Step 5: Complete Training Loop

def train_grpo(
    policy_model,
    ref_model,
    reward_model,
    train_prompts,
    K=8,
    num_epochs=3,
    batch_size=4,
    lr=1e-6,
    clip_eps=0.2,
    kl_coeff=0.01,
):
    """Complete GRPO training loop."""
    optimizer = torch.optim.AdamW(policy_model.parameters(), lr=lr)

    for epoch in range(num_epochs):
        for batch_start in range(0, len(train_prompts), batch_size):
            batch_prompts = train_prompts[batch_start:batch_start + batch_size]

            all_completions = []
            all_advantages = []
            all_old_log_probs = []
            all_prompts_expanded = []

            for prompt in batch_prompts:
                # Step 1: Generate K completions
                completions, log_probs = group_sample(
                    policy_model, prompt, K=K
                )

                # Step 2: Compute rewards
                rewards = compute_rewards(reward_model, prompt, completions)

                # Step 3: Compute group-relative advantages
                advantages = compute_group_advantage(rewards)

                # Store for training
                for j in range(K):
                    all_prompts_expanded.append(prompt)
                    all_completions.append(completions[j])
                    all_advantages.append(advantages[j])
                    all_old_log_probs.append(log_probs[j])

            # Step 4: Policy gradient update
            advantages_tensor = torch.stack(all_advantages)
            old_lp_tensor = torch.stack(all_old_log_probs)

            loss = grpo_loss(
                policy_model, ref_model,
                all_prompts_expanded, all_completions,
                advantages_tensor, old_lp_tensor,
                clip_eps=clip_eps, kl_coeff=kl_coeff,
            )

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(policy_model.parameters(), 1.0)
            optimizer.step()

            # Logging
            mean_reward = rewards.mean().item()
            print(f"Epoch {epoch}, Loss: {loss.item():.4f}, "
                  f"Mean Reward: {mean_reward:.4f}")

📊

GRPO Training Hyperparameters (DeepSeek-R1 Style)

Parameter	Value	Rationale
K (group size)	8-16	Larger K = more stable advantages, more compute
clip_eps	0.2	Standard PPO clipping, prevents large updates
kl_coeff (beta)	0.01-0.05	Higher = stay closer to reference, less exploration
Learning rate	1e-6 to 5e-7	Much lower than SFT (prevent catastrophic forgetting)
Max tokens per completion	2048-8192	Reasoning traces can be long
Temperature	1.0	Full stochasticity for diverse group samples
Gradient clipping	1.0	Prevents RL training instability

The DeepSeek-R1 Three-Stage Recipe

Stage 1: SFT — Fine-tune base model on high-quality reasoning traces. This gives the model the format of thinking.
Stage 2: GRPO — Run GRPO with a math/code reward model. This teaches the model to reason correctly (not just format correctly).
Stage 3: Rejection Sampling + SFT — Generate many GRPO solutions, keep the best (correct + clean), fine-tune on those. This “distills” the RL policy into a cleaner model.

DeepSeek-R1 Quality Progression Through Stages

(AIME 2024 accuracy %)

Base model (direct)

28 AIME 2024 accuracy %

After SFT (Stage 1)

52 AIME 2024 accuracy %

After GRPO (Stage 2)

71 AIME 2024 accuracy %

After Rejection + SFT (Stage 3)

76 AIME 2024 accuracy %