CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning
Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Weiwen Liu, Haoxuan Li, Yong Yu, Weinan Zhang, Mengyue Yang

TL;DR
This paper introduces CoLD, a framework that reduces length bias in process reward models for mathematical reasoning, leading to more accurate and concise multi-step reasoning in large language models.
Contribution
CoLD employs counterfactual reasoning, explicit length penalties, and a learned bias estimator to mitigate length bias in reward models for mathematical problem solving.
Findings
CoLD improves step selection accuracy in mathematical reasoning tasks.
It encourages more concise and logically valid reasoning outputs.
The approach enhances downstream reinforcement learning performance and generalizes across domains.
Abstract
Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
