Think Twice: Branch-and-Rethink Reasoning Reward Model

Yizhu Jiao; Jiaqi Zeng; Julien Veron Vialard; Oleksii Kuchaiev; Jiawei Han; Olivier Delalleau

arXiv:2510.23596·cs.CL·January 30, 2026

Think Twice: Branch-and-Rethink Reasoning Reward Model

Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau

PDF

2 Models 3 Reviews

TL;DR

This paper introduces branch-and-rethink reward modeling, a two-turn process that enhances reasoning accuracy by focusing on critical dimensions and re-evaluating hypotheses, leading to improved performance on benchmark tasks.

Contribution

The paper proposes a novel two-turn reward model that mimics think-twice reasoning, reducing judgment diffusion and improving sensitivity to errors in reward evaluation.

Findings

01

Achieves state-of-the-art results on three reward modeling benchmarks.

02

Reduces judgment diffusion by focusing on critical evaluation dimensions.

03

Enhances sensitivity to subtle errors through targeted rethinking.

Abstract

Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper's core idea is reasonable. Diagnosing "judgment diffusion" and transferring the "think-twice" principle from solvers (LLMs) to judges (RMs) is a clever and logical contribution. 2. The paper is supported by strong SOTA results across three diverse benchmarks and is validated by exceptionally comprehensive ablation studies that justify each design component. 3. The paper is well-written and clearly structured. The core problem and the proposed solution are easy to understand. 4. T

Weaknesses

1. Cost: The BR-RM is a two-stage generative model. Compared to a scalar RM, which requires a single forward pass, this approach introduces substantial latency and complexity, especially during RLHF training where the RM is called millions of times. The paper doesn't quantify this two-turn cost, making its practical viability for large-scale application questionable. 2. The method relies heavily on a predefined "universal set of criteria" and "task-specific evaluation hierarchies". The perform

Reviewer 02Rating 4Confidence 3

Strengths

1. The observation of focus dilution and shallow analysis are sound. 2. The benchmark performances are strong.

Weaknesses

1. Lack of insights. The paper lacks in-depth analysis and the ablations are not informative (only benchmark scores). 2. Too much inductive bias. Many important design choices are manually picked without much validation.

Reviewer 03Rating 4Confidence 4

Strengths

* The paper identifies judgment diffusion in reward models and motivates a focused second pass that aims to allocate test time compute where risk is highest. The concept and naming are crisp and intuitive. * The strict formatting penalty plus binary outcome reward is easy to implement and aligns with the evaluation objective. The paper also shows why finer grained scoring or extra branch rewards underperform. * BR-RM-Qwen-14B achieves 92.1 on RewardBench, 85.9 on RM Bench, and 74.7 on RMB, pr

Weaknesses

* The paper highlights best averages, but several baselines appear very recent and some cells are missing. It would help to provide complete, reproducible comparison tables and lock evaluations with identical prompting and sampling across all methods. The current Table 1 summary is helpful but not fully auditable from the text alone. * The format penalty is large in magnitude, and the same terminal reward is assigned to both turns uniformly across tokens. This could incentivize shortest valid t

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.