Are complicated loss functions necessary for teaching LLMs to reason?
Gabriele Carrino, Andrea Sassella, Nicolo Brunello, Federico Toschi, Mark James Carman

TL;DR
This paper investigates the necessity of complex loss components in training LLMs for reasoning, finding that simpler methods like REINFORCE with group advantage can outperform more complicated approaches like GRPO.
Contribution
The paper introduces RGRA, a simplified reinforcement learning approach that removes unnecessary components from GRPO, improving reasoning performance in LLMs.
Findings
Negative feedback is crucial for effective training.
PPO style constraints are not necessary for reasoning improvement.
RGRA outperforms GRPO on mathematical benchmarks.
Abstract
Recent advances in large language models (LLMs) highlight the importance of post training techniques for improving reasoning and mathematical ability. Group Relative Policy Optimization (GRPO) has shown promise in this domain by combining group relative advantage estimation, PPO style clipping, and KL regularization. However, its complexity raises the question of whether all components are necessary for fostering reasoning behaviors. We conduct a systematic analysis of GRPO and identify two key findings: (1) incorporating negative feedback is essential training solely on actions above a baseline limits learning; and (2) PPO style constraints, such as policy ratio clipping, are not required to improve mathematical reasoning or performance. Building on these insights, we propose REINFORCE with Group Relative Advantage (RGRA), a simplified variant that retains group relative advantage…
Peer Reviews
Decision·Submitted to ICLR 2026
* Systematic studies like the one that the paper conducts is generally important for the community, especially for understanding RL post-training for LLMs. * Authors test on two different model families and take care in evaluating on a comprehensive set of benchmarks split across Chinese/English and math/other subject domains.
* The model scale and setting (<=1.5B parameter models with LoRA fine-tuning) is limited and it's unclear if their findings extrapolate to larger model scales and full fine-tuning. * In particular, prior work [1] seems to show a different result that positive-only reinforcement can be competitive with GRPO/PPO provided verifiable rewards are used and poor prompts are filtered. The findings from Xiong et al. are from larger models (7B-70B), which supports the potential limitations of the model si
This paper studies an important question in RL post-training, namely which components are required in the loss function to get the models to perform well. Based on their findings, the authors propose RGRA for LLM post training.
There are several major weaknesses with this paper. To begin, the framing of the paper is an ablation over the main components of the GRPO loss. However, there are several key components missing from this ablation: - as far as I understand, the authors do not sweep over the hyperparameters of any of the baselines they run. Critically, for an ablation over components of GRPO, they do not sweep over the number of rollouts, nor over the amount of steps taken off policy by the algorithm (I am referr
N/A
* There is a severe lack of novelty in the paper. The proposed RGRA is essentially GRPO without importance sampling. * The paper is poorly composed; the results are not well organized. Figure 1 occupies an entire page without any accompanying analysis in the caption. Tables 1 and 2 are also poorly formatted, lacking proper bolding and explanations for abbreviations. From this standpoint alone, the paper feels far from complete. * There is almost no discussion regarding the differences between RG
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Reinforcement Learning in Robotics
