Libra: Assessing and Improving Reward Model by Learning to Think

Meng Zhou; Bei Li; Jiahao Liu; Xiaowen Shi; Yang Bai; Rongxiang Weng; Jingang Wang; Xunliang Cai

arXiv:2507.21645·cs.CL·July 30, 2025

Libra: Assessing and Improving Reward Model by Learning to Think

Meng Zhou, Bei Li, Jiahao Liu, Xiaowen Shi, Yang Bai, Rongxiang Weng, Jingang Wang, Xunliang Cai

PDF

1 Datasets 3 Reviews

TL;DR

This paper introduces Libra, a framework with a new reasoning benchmark and a learning-to-think approach to enhance reward models for complex reasoning in language models, surpassing previous limitations.

Contribution

It presents Libra Bench for challenging reasoning evaluation and Libra-RM, a generative reward model improved through learning-to-think methods, advancing reasoning capabilities.

Findings

01

Libra-RM achieves state-of-the-art results on reasoning benchmarks.

02

Libra Bench effectively evaluates reasoning performance.

03

Libra-RM shows potential to improve reasoning with unlabeled data.

Abstract

Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations: 1) the dependence on finely annotated reference answer to attain rewards; and 2) the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper is relatively well motivated to investigate how reasoning-style generative RMs may be better judges on reasoning tasks than existing RM or LLM-as-judge. The reported performance gains of the RMs on Libra Bench is very significant, indicating the effectiveness of the proposed method.

Weaknesses

The biggest and potentially fatal issue of the current work is data contamination. In particular, both the benchmark (Libra Bench) and the RM (Libra RM) rely on rollouts generated by models in the R1 model facility (DeepSeek-R1, R1-distilled SLMs). As such, it is very difficult to gauge the degree to which the performance gain that Libra RM has over competing method is coming from data distribution or not. Additionally, the various choices of Libra-RM experimentations are not ablated. For exam

Reviewer 02Rating 6Confidence 4

Strengths

1. Proposes a clear and comprehensive framework integrating a new reasoning-oriented benchmark (Libra Bench) and generative reward models (Libra-RM). These benchmark is well-designed with challenging math problems and advanced reasoning models, effectively testing RM correctness. 2. The method is well-motivated and clearly presented. Experimental settings are thorough, showing strong state-of-the-art results and meaningful correlation with downstream reasoning performance.

Weaknesses

1. The work does not appear highly original to me, as similar directions have been explored in prior studies such as [1] and [2]. It would be helpful if the authors could discuss these related works in more detail or include a comparison in the experiments. 2. The training data is mainly distilled from several reasoning models, so the generalization to unseen models or broader domains remains uncertain. Providing experiments or analysis on cross-model transferability would strengthen the paper.

Reviewer 03Rating 4Confidence 4

Strengths

* Comprehensive baseline methods are presented in the experiments, including evaluating 15 different reward models on the proposed dataset and 6 different reward models for downstream applications. * More than 200 questions with responses from 5 different DeepSeek and Qwen variants are collected and annotated in the corpus.

Weaknesses

* The experimental results in Table 3 appear incomplete. A statistical significance test is missing, which is crucial for demonstrating that the reported improvements are meaningful. Some performance gains, such as those between Libra-RM-32B-MATH, DeepSeek-R1, and Qwen3-32B in Table 2, are relatively marginal. Without significance testing, it is difficult to assess the contribution of the proposed method. * The experimental setup for the unverifiable reasoning scenario (Section 6) is insufficien

Code & Models

Datasets

meituan/Libra-Bench
dataset· 48 dl
48 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.