Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases
Hui Huang, Xuanxin Wu, Muyun Yang, Yuki Arase

TL;DR
This paper systematically compares Large Reasoning Models (LRMs) and non-reasoning LLMs, showing LRMs excel in judgment accuracy and robustness but still suffer from biases, which can be mitigated by the proposed PlanJudge strategy.
Contribution
It is the first comprehensive study demonstrating the advantages and biases of LRMs as judges and introduces PlanJudge to reduce evaluation biases effectively.
Findings
LRMs outperform non-reasoning LLMs in judgment accuracy on reasoning tasks.
LRMs show better instruction-following and robustness against adversarial attacks.
PlanJudge reduces biases in LLM-based judgments without sacrificing accuracy.
Abstract
This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judges to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior evaluation instruction-following capabilities; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong evaluation biases. To mitigate this bias vulnerability, we propose PlanJudge, a lightweight evaluation strategy that prompts the model to generate an explicit evaluation plan before executing the judgment. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in LLM-as-a-Judge while preserving overall judgment accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
