The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Xinyu Zhu; Mengzhou Xia; Zhepei Wei; Wei-Lin Chen; Danqi Chen; Yu Meng

arXiv:2506.01347·cs.CL·October 28, 2025

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper reveals that training language models with only negative reinforcement—penalizing incorrect responses—can significantly enhance reasoning performance, often surpassing traditional reinforcement learning methods that reinforce correct responses.

Contribution

It uncovers the surprising effectiveness of negative reinforcement alone in improving LLM reasoning, and proposes a simple modified objective that boosts performance across multiple benchmarks.

Findings

01

Negative reinforcement improves performance across various Pass@$k$ metrics.

02

Reinforcing only correct responses can reduce diversity and degrade performance.

03

Negative sample reinforcement suppresses incorrect outputs and redistributes probability mass.

Abstract

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B, Qwen3-4B and Llama-3.1-8B-Instruct on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Pass@ $k$ spectrum $k$ up to 256), often matching or surpassing PPO and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tianhongzxy/rlvr-decomposed
pytorchOfficial

Videos

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning· slideslive

Taxonomy

TopicsDigital Rights Management and Security · Artificial Intelligence in Law · Multi-Agent Systems and Negotiation

MethodsEntropy Regularization · Balanced Selection · Proximal Policy Optimization