Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Peter Chen; Xiaopeng Li; Ziniu Li; Wotao Yin; Xi Chen; Tianyi Lin

arXiv:2512.16912·cs.LG·January 27, 2026

Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the paradoxical effects of spurious rewards and entropy minimization in reinforcement learning with verifiable rewards, revealing how clipping bias and reward misalignment influence model confidence and reasoning performance.

Contribution

It uncovers the role of clipping bias in reducing policy entropy and proposes a reward-misalignment model to explain spurious rewards' benefits in RLVR.

Findings

01

Clipping bias reduces policy entropy, leading to more deterministic outputs.

02

Entropy minimization alone does not improve reasoning performance.

03

Spurious rewards can enhance performance through reward misalignment and model contamination.

Abstract

This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

The authors carefully investigate open questions related to RLVR and present interesting insights. The combination of empirical and theoretical analysis is compelling.

Weaknesses

1. The presentation of the motivation for each analysis and key takeaways could be more clear. A few general suggestion would be to explicitly state the goal of each analysis at the beginning of each section, to lead with the empirical results that the theory aims to explain, and to use more descriptive figure captions that discuss the key conclusion from each figure. At a few points reading through the paper, it was not clear to me why an analysis was being conducted and what conclusion the aut

Reviewer 02Rating 6Confidence 3

Strengths

- The paper tackles an important and timely question in reinforcement learning for large language models. - Rigorous theoretical analysis linking clipping and entropy, extending prior accounts. - The reward-misalignment model offers a probabilistic explanation for the benefits of random rewards. - The paper is well motivated, interesting, and clearly presented.

Weaknesses

- The evaluations are concentrated on MATH500, and ablations on hyperparameters (e.g., clipping ratio, group size) are missing. - Some findings, while formalized, may only confirm intuitively expected behaviors (e.g., entropy minimization failing when incorrect trajectories are the peak of the distribution).

Reviewer 03Rating 4Confidence 2

Strengths

• Interesting theoretical framing: The paper offers a formal treatment linking clipping bias and policy entropy, contributing to a better conceptual understanding of RLVR optimization dynamics. • Relevance to ongoing discourse: Given recent debates about “spurious reward” effects and entropy minimization in reasoning LLMs, this paper addresses a timely and relevant topic for the ICLR community. • Clarity of theoretical results: The analytical sections (Theorems 3.3–4.2) are clearly presented and

Weaknesses

• Experimental design inconsistency: Although the paper claims “extensive experiments across multiple model families and sizes,” not all experiments are conducted uniformly; rather, different subsets of experiments use different models and data. This fragmented setup makes it difficult to assess the generality of the conclusions. • Limited empirical depth: The experiments do not convincingly support the claim of “reconciling conflicting reports” in the literature. The evaluation scope remains na

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Reinforcement Learning in Robotics