Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

Zhaohui Yang; Yuxiao Ye; Shilei Jiang; Chen Hu; Linjing Li; Shihong Deng; Daxin Jiang

arXiv:2505.14403·cs.AI·September 16, 2025

Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

Zhaohui Yang, Yuxiao Ye, Shilei Jiang, Chen Hu, Linjing Li, Shihong Deng, Daxin Jiang

PDF

Open Access 1 Video

TL;DR

This paper introduces BCPG-NSA, a novel offline RL framework that leverages negative samples in reasoning datasets to improve LLM reasoning performance, efficiency, and robustness.

Contribution

It proposes a new fine-grained policy optimization method that effectively utilizes negative samples through segmentation, correctness assessment, and augmentation.

Findings

01

Outperforms baselines on math and coding reasoning benchmarks

02

Achieves higher sample efficiency in training

03

Demonstrates robustness and scalability across multiple iterations

Abstract

Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning· underline

Taxonomy

TopicsNatural Language Processing Techniques