Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning
Jingchu Gai, Guanning Zeng, Huaqing Zhang, Aditi Raghunathan

TL;DR
This paper introduces differential smoothing, a principled method to mitigate diversity collapse in RL fine-tuning of large language models, leading to improved correctness and diversity across multiple tasks and models.
Contribution
It provides a formal analysis of diversity collapse in RL fine-tuning and proposes differential smoothing, a novel method that provably enhances both correctness and diversity.
Findings
Differential smoothing outperforms vanilla RL and entropy heuristics.
Up to 6.7% improvement on AIME24 dataset.
Consistent gains across models from 1B to 7B parameters.
Abstract
It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to diversity collapse, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method -- differential smoothing -- that provably improves both correctness and diversity,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper provides a unifying theoretical framework for existing methods which add an entropy-modifying term to the RL objective 2. The empirical results are strong, showing consistent improvement against previous methods 3. Theorems clearly demonstrate the gap that previous methods did not address
1. In (Section 3.1, line 155 to 161), the paper presents a reward function that is the indicator function for sampled positive examples from $\pi_\text{base}$. It is unclear how applicable this reward function is in practice. Learned reward functions (e.g. from preference data as in RLHF [1]) do not match this framework. When using verifiable rewards as in GRPO [2], the verifier is run on the model outputs in each iteration rather than "sampled from the base policy, $\pi_\text{base}$" (Section 3
- The sharpening and lack of diversity is an important problem studied - The paper provides a theoretically justified method for addressing the sharpening issue of RL fine-tuning, and the solution is simple to adapt for existing RL algorithms - The paper provides solid empirical evidence across various reasoning benchmarks
I'm not an expert on the reasoning datasets, but it seems that the pass rate changes from the original GRPO and the proposed DS-GRPO are consistent but also modest.
- Rigorous theoretical framework: The paper provides a formal and intuitive theoretical analysis of the sharpening effect - The method is tested on multiple reasoning benchmarks (MATH500, AIME24/25, OlympiadBench, AMC23, Countdown) and several model families (Qwen2.5, Qwen3, Ministral-8B) - The paper introduces an improved entropy-based method as comparison
- The paper would benefit from a comparison of DS-GRPO with more recent RL methods such as CISPO[1]. This would clarify how the proposed approach performs relative to the current state of the art. - The paper reports absolute diversity measures, but typically diversity collapse is assessed relative to the base model. Showing how diversity changes compared to the pre-training baseline (e.g. [1], Figure 12) would provide stronger evidence that DS-GRPO truly mitigates entropy collapse. - Since the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
