Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, Tianlu Wang

TL;DR
This paper introduces EvalPlanner, a novel method that enhances LLM-based evaluation by generating and executing unconstrained evaluation plans, leading to improved judgment accuracy and state-of-the-art performance on multiple benchmarks.
Contribution
The paper proposes EvalPlanner, a preference optimization algorithm that separates planning from reasoning, enabling more effective and flexible evaluation of LLM responses.
Findings
Achieves state-of-the-art 93.9 score on RewardBench.
Outperforms previous models on RM-Bench, JudgeBench, and FollowBenchEval.
Demonstrates the importance of planning in LLM-based evaluation models.
Abstract
LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Education and Practice Innovations · Artificial Intelligence in Law
