RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
Zhenwei Tang, Zhaoyan Liu, Rasa Hosseinzadeh, Tongzi Wu, Keyvan Golestan, Jesse C. Cresswell

TL;DR
RankJudge is a novel benchmark generator designed to evaluate LLM-based judges on multi-turn, document-grounded conversations by injecting specific flaws and ranking judgments, enabling precise assessment of model evaluation quality.
Contribution
It introduces a new method for creating paired multi-turn conversation benchmarks with injected flaws, facilitating detailed evaluation of LLM judges across multiple domains.
Findings
RankJudge enables stable judge rankings under partial observability.
It effectively isolates failure categories to individual conversation turns.
Dynamic curation reduces label noise, improving evaluation accuracy.
Abstract
As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
