Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

Changyuan Tian; Zhicong Lu; Shuang Qian; Nayu Liu; Peiguang Li; Li Jin; Leiyi Hu; Zhizhao Zeng; Sirui Wang; Ke Zeng; and Zhi Guo

arXiv:2511.10303·cs.CL·November 14, 2025

Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

Changyuan Tian, Zhicong Lu, Shuang Qian, Nayu Liu, Peiguang Li, Li Jin, Leiyi Hu, Zhizhao Zeng, Sirui Wang, Ke Zeng, and Zhi Guo

PDF

Open Access

TL;DR

This paper introduces a perplexity-aware reinforcement learning approach to improve large language models' critique accuracy in multi-step math reasoning by addressing their bias towards lower perplexity solutions.

Contribution

It identifies and quantifies the imbalanced evaluation preference in LLMs and proposes a novel reinforcement learning method to rectify this bias, enhancing critiquing performance.

Findings

01

LLMs tend to judge lower perplexity solutions as correct.

02

The proposed method effectively rectifies the bias, improving critiquing accuracy.

03

Experimental results validate the approach on OPS and other benchmarks.

Abstract

To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Multimodal Machine Learning Applications