Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR
Yash Ingle, Jaival Chauhan, Ankit Yadav, Sudhakar Mishra

TL;DR
This paper introduces adaptive and confidence-weighted negative reinforcement techniques to improve large language model reasoning by dynamically balancing correction and diversity during training.
Contribution
It proposes novel adaptive scheduling and confidence-based penalty weighting methods for negative reinforcement in LLM training, enhancing reasoning performance.
Findings
Improved performance on MATH, AIME 2025, and AMC23 datasets.
Adaptive and confidence-weighted methods outperform fixed penalty approaches.
Formal analysis shows effective token-level update control.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a highly effective method for improving the reasoning abilities of Large Language Models (LLMs). Recent research shows that Negative Sample Reinforcement (NSR) -- which focuses on penalizing incorrect steps rather than simply rewarding correct ones -- can match or even exceed the performance of more complex frameworks like PPO and GRPO across the entire Pass@k spectrum. However, current NSR techniques usually apply a fixed penalty throughout the training process and treat every incorrect response with the same weight. To address these limitations, we propose two extensions to the NSR framework: Adaptive Negative Sample Reinforcement. Rather than using a fixed update rule, A-NSR uses time-dependent scheduling functions. In the initial training phases, the system focuses heavily on correcting errors to stabilize the model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
