Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting
Si Shen, Peijun Shen, Wenhua Zhao, Danhao Zhu

TL;DR
This paper introduces S-GRPO, a noise-aware enhancement to group-relative policy optimization, significantly improving the robustness and performance of large reasoning models under noisy reward signals.
Contribution
We propose S-GRPO, a novel method that derives optimal advantage weights to stabilize training in the presence of reward noise, outperforming existing approaches.
Findings
S-GRPO achieves +2.5% to +2.4% performance gains on various benchmarks.
S-GRPO maintains stable learning under 20% synthetic reward noise.
Standard GRPO fails with high noise, while S-GRPO remains effective.
Abstract
Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This problem is most severe in unbalanced response groups, paradoxically degrading the signal precisely when it should be most informative. To address this challenge, we propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training. Our comprehensive experiments on mathematical reasoning benchmarks demonstrate S-GRPO's effectiveness and robustness. On various models, S-GRPO significantly outperforms DR. GRPO, achieving performance gains of +2.5% on Qwen-Math-7B-Base, +2.2% on Llama-3.2-3B-Base, and +2.4% on Qwen-Math-1.5B-Instruct. Most critically, while…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear Problem Definition and In-depth Analysis: The paper provides a remarkably thorough analysis of the impact of the "think-answer mismatch" within the GRPO framework. Instead of merely discussing the general harm of noise, it precisely identifies, through mathematical derivation, that "group imbalance" is the key factor amplifying the effect of noise. This insight is both profound and enlightening. 2. Elegant and Theoretically Grounded Methodology: S-GRPO is not presented as an empirical "
1. Limitations of the Symmetric Noise Model: A core assumption of the proposed method is the presence of symmetric noise. However, in practical scenarios, the probabilities of false positives and false negatives may not be symmetric. 2. Dependence on the Noise Rate p and Insufficient Guidance for its Selection: The paper observes that the optimal value of p is related to the model's scale, which is a valuable finding. However, it does not offer clear guidance on how to efficiently estimate an ap
1. The paper clearly defines the Think-Answer Mismatch problem in GRPO and provides a convincing analysis of its impact on learning stability. 2. The proposed S-GRPO introduces a principled noise-aware reweighting method derived from a theoretical foundation. 3. Experimental results show consistent, reproducible improvements across several mathematical reasoning benchmarks.
1. The method depends on a manually set noise parameter 𝑝 p, which may require case-specific tuning. 2. The assumption of symmetric reward noise simplifies the training environment but may not accurately represent real-world mismatch patterns where errors are often asymmetric. 3. The robustness experiments are based on artificially injected synthetic noise levels (up to 20%), which likely exceed the noise typically observed in real reasoning datasets, raising questions about whether such high
1. This paper try to address a significant problem: the issue of false positives where the "Thinking" process and the final "Answer" do not align. 2. The approach is novel to me. Instead of the common practice of trying to identify errors within the thinking process, this paper proposes to reweight the advantage values to mitigate the impact of this mismatch. 3. The paper is well-written, clear, and easy to understand.
(If my understanding is incorrect, please correct me) 1. **Concerns about generalization.** The derivation, starting from Equations 5 and 6 and continuing to Equation 11, seems entirely predicated on the assumption that the random variable $r$ (reward) follows a Bernoulli distribution. However, this assumption may not always hold in practice. For instance, one might use $\{-1, 1\}$ reward pairs or even continuous rewards. In such scenarios, how would the proposed method be formalized? Would th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Logic, Reasoning, and Knowledge
