How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Minzhu Tu, Shiyu Ni, Keping Bi

TL;DR
This paper examines how access to reasoning chains influences large language models' ability to judge answer correctness, revealing that reasoning quality and fluency significantly impact judgment accuracy.
Contribution
It provides a systematic analysis of how reasoning chains affect LLM-based judgment, highlighting the challenges in distinguishing genuine reasoning from superficial fluency.
Findings
Weak judges are easily misled by fluent but incorrect reasoning.
Strong judges can partially leverage reasoning but are still misled by high-quality reasoning.
Both fluency and factuality of reasoning chains are critical signals for judgment.
Abstract
Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator's reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
