Ginger LabsGinger LabsGinger Labs
BlogsCareers

Ginger Labs

Ginger LabsGinger LabsGinger Labs

Programming LMs, not prompting.

Resources

  • Blog
  • Careers

Legal

  • Terms of Service
  • Privacy Notice

Contact

contact@gingerlabs.com
Ginger Labs © 2026

On this page

All posts

DAPO: A Stronger Alternative to GRPO

How Decoupled Clip and Dynamic Sampling Policy Optimization addresses GRPO's limitations with decoupled clipping, dynamic sampling, token-level gradients and soft length penalties.

1 Apr, 2026|Shubham Yadav|RL Optimization|16 min read
dapogrporeinforcement-learningllmfine-tuningpolicy-optimizationrl
DAPO: A Stronger Alternative to GRPO

Fine-tuning Large Language Models (LLMs) using reinforcement learning has evolved rapidly with methods like GRPO (Group Relative Policy Optimization) becoming widely adopted. However, as these models are increasingly applied to complex reasoning and alignment tasks, limitations in their stability, efficiency and learning signal quality have become more apparent. Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) emerges as a refined approach that addresses these shortcomings through four core improvements:

  • Decoupled clipping — asymmetric bounds for positive and negative updates
  • Dynamic sampling — filtering out uninformative batches
  • Token-level policy gradients — precise credit assignment per token
  • Soft length penalty — controlling verbosity without hard cutoffs

Where GRPO Falls Short

DAPO is a reinforcement learning–based optimization algorithm built on top of the VERL (Volcano Engine RL) framework, designed as an advancement over GRPO — particularly in tasks such as reasoning, writing, and alignment where reward signals tend to be noisy, subjective or ambiguous. While GRPO introduced a simpler alternative to traditional PPO-style methods by removing the need for a critic model, it struggles in these complex scenarios due to several key limitations:

  • Zero-gradient batches — when all sampled responses receive similar rewards, normalization eliminates meaningful differences, resulting in no effective learning
  • Symmetric clipping — uniform PPO-style clipping restricts both positive and negative updates equally, limiting the model's ability to strongly reinforce high-quality outputs
  • No filtering of uninformative samples — all prompts are treated as equally valuable, wasting computation on low-signal batches
  • Coarse credit assignment — sequence-level rewards cannot distinguish which tokens or reasoning steps drove the outcome

These issues are further amplified in subjective tasks, where reward distributions are less distinct. DAPO addresses each of these through its four core innovations.

How DAPO Improves on GRPO

Decoupled Clipping: Asymmetric Updates for Better Exploration

GRPO uses a single symmetric clipping range (ε ≈ 0.2) to stabilize policy updates, but this also restricts how much the model can explore different reasoning paths. DAPO introduces a decoupled clipping mechanism with separate bounds for positive and negative updates. Specifically, it allows a higher clipping range (around 0.2–0.28) for upward updates, enabling the model to make stronger adjustments when high-reward responses are observed, while maintaining a lower bound for downward updates to preserve stability.

The core issue with GRPO's symmetric constraint is that it treats positive and negative gradients equally. This dampens the impact of high-reward samples because even beneficial updates are clipped aggressively, limiting the model's ability to strongly reinforce correct reasoning paths. As a result, GRPO often exhibits slow improvement in tasks where clear reward separation is rare. DAPO counters this by explicitly breaking the symmetry — high-reward trajectories are amplified through relaxed clipping, while low-reward or unstable updates remain tightly controlled. This can be seen as a targeted relaxation of PPO-style constraints, tailored for LLM fine-tuning where exploration and credit assignment are more critical than strict policy conservatism.

Dynamic Sampling: Filtering Out Uninformative Batches

Unlike GRPO, which processes all prompts uniformly, DAPO introduces a dynamic, information-aware sampling strategy. In GRPO, when all generated responses for a prompt receive similar rewards, the variance collapses, resulting in near-zero advantages and effectively producing zero-gradient batches. These batches contribute little to learning while still consuming computational resources.

DAPO incorporates an information-theoretic filtering mechanism that selectively focuses on prompt groups where reward variance is sufficiently high to provide a meaningful learning signal. Uninformative groups are skipped, improving the signal-to-noise ratio during training. This also introduces an implicit curriculum learning effect — as the model improves, simpler prompts naturally become less informative and are filtered out, allowing training to concentrate on harder, more discriminative examples.

This variance-based filtering is more principled than uniform sampling because higher variance implies greater entropy reduction potential and stronger gradients for policy improvement. In contrast, low-variance groups contribute mostly noise. By discarding them, DAPO effectively performs importance sampling over prompt groups, prioritizing those with the highest expected learning signal — making training both more stable and computationally efficient.

Token-Level Gradients: Precise Credit Assignment

GRPO computes policy gradients at the sequence level, where each generated response is assigned a single reward that is uniformly backpropagated across all tokens. This introduces a fundamental credit assignment problem — the model cannot distinguish which specific tokens or reasoning steps contributed to success or failure. Important tokens in long responses get diluted, and errors at specific steps are not precisely corrected.

DAPO shifts to a token-level policy gradient formulation, where each token contributes individually to the learning signal. Instead of treating the output as a monolithic unit, DAPO decomposes learning across the generation process, allowing it to capture step-wise correctness. This is particularly important in long chain reasoning tasks, where early tokens may be correct while later ones introduce errors, or vice versa.

This change also improves gradient alignment and reduces variance. In sequence-level training, gradients are noisy because the same reward is broadcast across all tokens, including irrelevant ones. Token-level gradients are more selective and better aligned with actual contributions. In effect, it shifts the learning paradigm from output-level supervision to process-level supervision — enabling models to learn not just which outputs are better, but which specific reasoning steps lead to better outcomes.

Soft Length Penalty: Controlling Verbosity Without Hard Cutoffs

Reinforcement learning based fine-tuning often suffers from a tendency toward overly long and verbose outputs, as models implicitly learn that longer responses may correlate with higher rewards. In GRPO, this is typically handled through external reward shaping or heuristics, which can be brittle and task-dependent.

DAPO addresses this more systematically by integrating a soft length penalty directly into the reward function. Instead of applying hard cutoffs, DAPO gradually reduces the reward for responses that exceed a certain length threshold and in extreme cases, fully cancels rewards for excessively long outputs. This ensures that verbosity is discouraged while still preserving useful reasoning content when longer responses are genuinely necessary.

The use of a gradual penalty avoids the pitfalls of hard constraints, which can truncate valid reasoning chains or introduce discontinuities in the optimization landscape. By allowing partial credit for moderately long responses while penalizing excessive verbosity, DAPO preserves meaningful gradients and avoids destabilizing updates — guiding the model towards concise yet complete reasoning.

When to Use DAPO vs GRPO

ScenarioGRPODAPO
Math, coding, factual QA (clear correctness, consistent rewards)Sufficient — strong signals, simple pipelineUnnecessary complexity
Alignment, creative writing, dialogue (subjective, noisy rewards)Struggles with zero-gradient batches and weak signalsStrong — dynamic sampling filters noise, decoupled clipping amplifies good outputs
Long-form generationVerbosity risk — no built-in length controlControlled — soft length penalty maintains information density
Multi-step reasoningCoarse credit assignment dilutes token-level signalsPrecise — token-level gradients assign credit per reasoning step
High reward varianceUniform sampling wastes compute on uninformative batchesEfficient — variance-based filtering focuses on informative samples

In short: GRPO works well when reward signals are stable, low-noise, and easily distinguishable — such as in mathematics and coding tasks with deterministic correctness checks. DAPO provides a more robust framework when learning signals are sparse, noisy, or require nuanced interpretation — including alignment, creative writing, and complex reasoning tasks where the additional mechanisms for filtering, clipping, and credit assignment translate into meaningful gains.

Key Takeaways

  1. GRPO's core limitations are structural. Zero-gradient batches, symmetric clipping, and sequence-level credit assignment are inherent to its design, not edge cases.
  2. DAPO's four innovations address distinct failure modes. Decoupled clipping handles exploration, dynamic sampling handles signal quality, token-level gradients handle credit assignment, and soft length penalties handle verbosity.
  3. Not every task needs DAPO. For objective tasks with clean rewards, GRPO's simplicity is an advantage. DAPO's complexity pays off primarily in subjective or noisy reward settings.
  4. Dynamic sampling creates an implicit curriculum. As the model improves, easy prompts are automatically filtered out, focusing training on progressively harder examples.
  5. Token-level optimization is the shift from judging outputs to judging reasoning. This is increasingly seen as essential for improving LLM reasoning capabilities.

You might also like

Ways to Counter Exploding Gradient During Fine Tuning LLMs

Ways to Counter Exploding Gradient During Fine Tuning LLMs

Shubham Yadav
28 Mar, 2026• 14 min read
fine-tuningllmexploding-gradientssftdpo
Trajectory-Based Advantage Estimation Methods

Trajectory-Based Advantage Estimation Methods

Shubham Yadav
4 Apr, 2026• 14 min read
reinforcement-learningadvantage-estimationgaemonte-carlotd-learning