TL;DR
ResRL introduces a novel reinforcement learning method that enhances LLM reasoning by decoupling semantic distributions of positive and negative responses, leading to improved reasoning and diversity.
Contribution
It proposes negative sample projection Residual Reinforcement Learning (ResRL), a new approach that improves reasoning in LLMs while maintaining response diversity.
Findings
ResRL outperforms strong baselines across twelve benchmarks.
ResRL surpasses NSR on mathematical reasoning by 9.4% in Avg@16.
ResRL effectively balances reasoning ability and diversity in LLM outputs.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
