Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

Jingchu Gai; Guanning Zeng; Huaqing Zhang; Aditi Raghunathan

arXiv:2511.19942·cs.LG·December 12, 2025

Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

Jingchu Gai, Guanning Zeng, Huaqing Zhang, Aditi Raghunathan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces differential smoothing, a principled method to mitigate diversity collapse in RL fine-tuning of large language models, leading to improved correctness and diversity across multiple tasks and models.

Contribution

It provides a formal analysis of diversity collapse in RL fine-tuning and proposes differential smoothing, a novel method that provably enhances both correctness and diversity.

Findings

01

Differential smoothing outperforms vanilla RL and entropy heuristics.

02

Up to 6.7% improvement on AIME24 dataset.

03

Consistent gains across models from 1B to 7B parameters.

Abstract

It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to diversity collapse, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method -- differential smoothing -- that provably improves both correctness and diversity,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper provides a unifying theoretical framework for existing methods which add an entropy-modifying term to the RL objective 2. The empirical results are strong, showing consistent improvement against previous methods 3. Theorems clearly demonstrate the gap that previous methods did not address

Weaknesses

1. In (Section 3.1, line 155 to 161), the paper presents a reward function that is the indicator function for sampled positive examples from $\pi_\text{base}$. It is unclear how applicable this reward function is in practice. Learned reward functions (e.g. from preference data as in RLHF [1]) do not match this framework. When using verifiable rewards as in GRPO [2], the verifier is run on the model outputs in each iteration rather than "sampled from the base policy, $\pi_\text{base}$" (Section 3

Reviewer 02Rating 6Confidence 2

Strengths

- The sharpening and lack of diversity is an important problem studied - The paper provides a theoretically justified method for addressing the sharpening issue of RL fine-tuning, and the solution is simple to adapt for existing RL algorithms - The paper provides solid empirical evidence across various reasoning benchmarks

Weaknesses

I'm not an expert on the reasoning datasets, but it seems that the pass rate changes from the original GRPO and the proposed DS-GRPO are consistent but also modest.

Reviewer 03Rating 4Confidence 3

Strengths

- Rigorous theoretical framework: The paper provides a formal and intuitive theoretical analysis of the sharpening effect - The method is tested on multiple reasoning benchmarks (MATH500, AIME24/25, OlympiadBench, AMC23, Countdown) and several model families (Qwen2.5, Qwen3, Ministral-8B) - The paper introduces an improved entropy-based method as comparison

Weaknesses

- The paper would benefit from a comparison of DS-GRPO with more recent RL methods such as CISPO[1]. This would clarify how the proposed approach performs relative to the current state of the art. - The paper reports absolute diversity measures, but typically diversity collapse is assessed relative to the base model. Showing how diversity changes compared to the pre-training baseline (e.g. [1], Figure 12) would provide stronger evidence that DS-GRPO truly mitigates entropy collapse. - Since the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification