Efficient Reasoning via Reward Model

Yuhao Wang; Xiaopeng Li; Cheng Gong; Ziru Liu; Suiyun Zhang; Rui Liu; Xiangyu Zhao

arXiv:2511.09158·cs.AI·November 13, 2025

Efficient Reasoning via Reward Model

Yuhao Wang, Xiaopeng Li, Cheng Gong, Ziru Liu, Suiyun Zhang, Rui Liu, Xiangyu Zhao

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a novel conciseness reward model and a new reward formulation to improve reasoning efficiency and accuracy in large language models, reducing verbosity and computational costs.

Contribution

It proposes a Conciseness Reward Model and a new reward function that enhance reasoning effectiveness and efficiency, addressing overthinking issues in large language models.

Findings

01

Achieves 8.1% accuracy improvement on mathematical benchmarks.

02

Reduces response token length by 19.9%.

03

Generalizes well to Llama and Mistral models.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs), enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper proposes a novel perspective: optimizing with conciseness rewards. 2. The paper is well written, with clear motivation and mathematical presentation. 3. The experiments and ablation setting are reasonable.

Weaknesses

The method is reasonable and simple, so my main question is about the experimental results. The issue of the reproduced baseline raises doubts about the effectiveness of the proposed method. The reproduced baseline in Table 2 differs significantly from the cited paper, especially the paper proposed Cos conducted same experiments on Llama-3.1 8B. The number of tokens shows that the baseline reproduced by the author has a length collapse, while the original Cos paper did not. This raises doubts a

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper is easy to follow and well structured 2. The proposed and trained reward model provide a new way for concise rewarding.

Weaknesses

My main concerns lie in the **evaluation design**. 1. The primary results focus on *Pass@K*, which is not a standard metric for evaluating current Large Reasoning Models (LRMs). While Pass@K is meaningful, a paper aiming to balance **accuracy and reasoning efficiency** should also report *Average@K* for fairer comparison with prior work. 2. The evaluated models are not representative of today’s **reasoning or thinking models**. The reported token reduction (from ~500 to ~300) appears modest,

Reviewer 03Rating 4Confidence 3

Strengths

1. **Clear problem identification.** The paper explicitly defines two major failure modes in current reward-based reasoning optimization — *length collapse* and *training collapse*. These phenomena are real, empirically verified, and practically important for RLHF/RLVR research. 2. **Elegant reward formulation.** The proposed Conciseness Reward Function (CRF) only rewards brevity when the reasoning is correct, effectively preventing reward hacking and over-penalization of valid long r

Weaknesses

1. **Over-reliance on the Qwen family models.** All main experiments — including baselines, reward models, and CRM training — are built upon Qwen2.5 and Qwen-Math checkpoints. Since Qwen’s pre-training corpus contains ** similar reasoning data**, there is a strong possibility of **data contamination** or leakage. This compromises the fairness and generalizability of results, especially when all baselines share the same Qwen backbone. 2. **Limited model diversity.** Although

Reviewer 04Rating 2Confidence 5

Strengths

1. Proposes a novel pipeline that incorporates a compact Conciseness Reward Model (CRM) to enable more efficient LLM reasoning. 2. Provides comprehensive empirical evaluations across multiple LLM backbones, including Qwen, LLaMA, and Mistral.

Weaknesses

1. The proposed idea of distilling a compact CRM from a large and powerful LLM is intuitive but lacks rigorous analysis. It remains unclear how the large model reliably generates accurate conciseness scores for reasoning paths. Moreover, if training efficiency is not a concern, it would be valuable to justify why the authors do not directly use Qwen-72B-Instruct to compute conciseness scores. 2. The baseline comparison is incomplete. Several representative post-training variants, such as DAPO,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare