Fine-tuning Large Language Models (LLMs) has become the default practice of adapting base models to real world tasks such as reasoning, summarization and alignment. Yet one of the most persistent and often underestimated obstacles during fine-tuning is the problem of exploding gradients. While traditionally associated with deep neural networks, this problem manifests in more subtle and complex ways in modern LLM fine-tuning pipelines like SFT, DPO and GRPO. Understanding and mitigating these instabilities is essential for building efficient and reliable training systems.
Understanding Exploding Gradients
Exploding gradients refer to a situation during training where the magnitude of gradients becomes excessively large, leading to unstable updates in model parameters. In traditional neural networks, this often manifests as sudden spikes in loss, numerical overflow or complete training divergence. However, in the context of fine-tuning Large Language Models, the issue appears in more subtle forms. Rather than purely numerical instability, exploding gradients frequently arise as disproportionately large or inconsistent updates driven by noisy, high-variance training signals — including conflicting gradients from multiple sampled outputs, poorly normalized rewards, and overly aggressive policy updates.
As a result, exploding gradients in LLM fine-tuning are not merely a numerical issue but also a structural one. They can lead to unstable training, sudden divergence, inefficient learning, and wasted computational resources. Understanding how different fine-tuning paradigms — SFT, DPO and GRPO — introduce their own variants of this issue is essential for designing training pipelines that are both stable and efficient.
Gradient Instability in Supervised Fine-Tuning (SFT)
Supervised Fine-Tuning (SFT) is the most straightforward and widely used paradigm for adapting large language models, where the model learns to imitate labeled input–output pairs using a standard cross-entropy loss. Despite its simplicity, SFT is not inherently stable. One of the primary causes of instability is an ill-conditioned loss landscape, especially when training long sequences or datasets with highly variable output lengths. In such cases, rare or high-loss tokens can trigger sharp gradient spikes. Additionally, when the model becomes overly confident, producing very sharp logits, even small prediction errors can lead to disproportionately large gradient updates. This effect is often amplified by noisy or inconsistent datasets, further destabilizing training.
The problem becomes more pronounced in LoRA-based SFT. Due to the low-rank constraint, the optimizer is forced to navigate a narrow and sharp loss landscape, which increases the likelihood of gradient explosions at specific phases of training. This often manifests as a characteristic double descent curve: the loss initially decreases, then spikes mid-training before stabilizing again. This behavior reflects the transition from underfitting to overfitting under rank constraints, as observed in recent works like LoRA-GGPO, which track gradient and weight norms throughout training to better understand these dynamics.
Mitigations for SFT
- Label smoothing — prevents the model from becoming overly confident, keeping gradients bounded
- Learning rate warmup — ensures early training steps remain stable by avoiding large initial updates
- Gradient clipping (typical max norm: 1.0) — provides a direct safeguard against sudden spikes
- Gradient and weight norm monitoring — approaches like LoRA-GGPO continuously track norms to detect instability early
- Gradient-guided perturbations — injecting controlled noise into model weights during critical phases to smooth out the loss landscape, encouraging convergence toward flatter minima
How DPO Amplifies Gradient Instability
Direct Preference Optimization (DPO) introduces a fundamentally different kind of instability compared to SFT. Rather than learning from ground-truth labeled outputs, DPO trains on preference pairs where the model is encouraged to assign higher probability to a chosen response over a rejected one. While this formulation removes the need for an explicit reward model and simplifies the pipeline, it embeds instability directly into the objective function. The core issue lies in the log probability ratio between the preferred and non-preferred responses. As the model improves, it naturally becomes more confident in its preferences, causing this ratio to grow rapidly. This, in turn, leads to gradients that scale disproportionately, not because of numerical overflow, but due to the over-amplification of preference differences.
As training progresses, this dynamic creates a feedback loop: higher confidence leads to larger gradients, which push the model toward even sharper probability distributions. The result is a tendency toward overconfident behavior, where the model assigns extreme probabilities to certain outputs while neglecting others. Unlike SFT, where instability is often tied to data properties or sequence length, DPO's instability is structural — it arises from how confidence is directly coupled with gradient magnitude in the loss formulation.
Mitigations for DPO
- β (beta) parameter tuning (typical range: 0.1–0.5) — a temperature-like scaling factor that controls the sharpness of the preference objective. Smaller values enforce more conservative updates by dampening the log probability ratio, effectively acting as a regularization mechanism
- Conservative DPO (cDPO) — introduces label smoothing into preference learning, softening targets to reduce overfitting to noisy or inconsistent annotations
- Logit scaling — explicitly constrains the range of model outputs to prevent extreme probability assignments
- Clipped objectives — bounds the magnitude of parameter updates even as the model becomes more confident
Ultimately, the instability in DPO can be understood as a consequence of the model becoming "too certain too quickly." Successful DPO training is less about eliminating confidence and more about regulating it through temperature scaling, smoothing, and constrained optimization, so that the model learns strong preferences without sacrificing stability or generalization.
Signal Conflicts and Instability in GRPO
Group Relative Policy Optimization (GRPO) represents one of the most complex and instability-prone stages of LLM fine-tuning, as it brings reinforcement learning into the loop with group-based comparisons instead of explicit labels or pairwise preferences. While GRPO removes the need for a separate critic or reward model, it introduces new forms of instability that arise from the interaction of multiple sampled outputs (rollouts) for the same prompt. Unlike SFT or DPO, where instability is largely tied to magnitude, GRPO's challenges are rooted in the quality and consistency of the learning signal itself.
Destructive Gradient Conflict
One of the most critical issues is destructive gradient conflict, as highlighted in approaches like DaGRPO. When multiple rollouts are highly similar, or even nearly identical, their gradients can point in opposing directions due to binary or coarse reward signals. This leads to updates that either cancel each other out or amplify noise, reducing sample efficiency and causing instability. The situation worsens for harder queries, where valid positive samples are rare. In such cases, the model may overfit to random correct responses or misleading signals, further destabilizing training.
Token-Level Conflicts
A related but deeper issue occurs at the token level, addressed by GTPO: the same token can appear in both positively and negatively rewarded outputs, causing contradictory updates where the model is simultaneously encouraged to increase and decrease its probability. Over time, this results in policy collapse, where correct tokens are penalized disproportionately, incorrect tokens gain probability, entropy increases, and the overall learning signal degrades.
Saturation-Phase Instability
GRPO also exhibits instability across training phases. Training typically progresses through three stages: slow initial learning, rapid improvement, and a final saturation phase. Beyond a certain point, often around 80% of an epoch, reward improvements plateau while gradients become increasingly weak and noisy. Continuing training in this phase not only wastes computation but also increases the risk of instability, as the model begins to learn from low-quality or noisy signals.
Mitigations for GRPO
- Sequence-level gradient rectification (DaGRPO) — masks low-distinctiveness samples, ensuring only meaningful differences contribute to updates
- Off-policy data augmentation (DaGRPO) — injects high-quality anchor samples when on-policy rollouts fail to provide useful signals
- Skipping negative gradient updates (GTPO) — prevents contradictory feedback on shared tokens by only applying positive updates
- Entropy-based filtering (GTPO) — selects only confident (low-entropy) outputs and explicitly penalizes high-entropy behavior to maintain structured learning
- Predictive early stopping — models reward trajectories as a function of training progress, model size, and initial performance to predict when further training will yield diminishing returns
Summary: Comparing Instabilities Across Paradigms
| Method | Primary Instability Source | Root Cause | Key Mitigations |
|---|---|---|---|
| SFT | Loss landscape, sharp logits | High-loss tokens and overconfident predictions in narrow loss landscapes | Label smoothing, warmup, gradient clipping, LoRA-GGPO |
| DPO | Confidence-gradient coupling | Log probability ratio grows as model becomes more confident | β tuning (0.1–0.5), cDPO label smoothing, logit scaling |
| GRPO | Signal conflict and degradation | Contradictory gradients from similar rollouts and noisy late-stage signals | DaGRPO, GTPO, entropy filtering, predictive early stopping |
Key Takeaways
- Exploding gradients in LLM fine-tuning are structural, not just numerical. The problem extends beyond gradient magnitude to include noisy, conflicting, and degrading learning signals.
- Each paradigm introduces its own variant of instability. SFT struggles with loss landscape sharpness, DPO with confidence-gradient coupling, and GRPO with signal quality and consistency.
- Stability is no longer just about clipping gradients. Modern approaches focus on resolving conflicts in the learning signal (DaGRPO, GTPO), regulating confidence (β tuning, cDPO), and knowing when to stop (predictive early stopping).
- Monitor before you mitigate. Tracking gradient norms, weight norms, reward variance, and entropy throughout training helps identify the specific form of instability before applying a fix.
- The right mitigation depends on the paradigm. There is no universal fix — choosing the appropriate strategy requires understanding whether the instability comes from the loss landscape, the objective function, or the reward signal.

