RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

Zhenwei Tang; Zhaoyan Liu; Rasa Hosseinzadeh; Tongzi Wu; Keyvan Golestan; Jesse C. Cresswell

arXiv:2605.21748·cs.CL·May 22, 2026

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

Zhenwei Tang, Zhaoyan Liu, Rasa Hosseinzadeh, Tongzi Wu, Keyvan Golestan, Jesse C. Cresswell

PDF

1 Repo 1 Datasets

TL;DR

RankJudge is a novel benchmark generator designed to evaluate LLM-based judges on multi-turn, document-grounded conversations by injecting specific flaws and ranking judgments, enabling precise assessment of model evaluation quality.

Contribution

It introduces a new method for creating paired multi-turn conversation benchmarks with injected flaws, facilitating detailed evaluation of LLM judges across multiple domains.

Findings

01

RankJudge enables stable judge rankings under partial observability.

02

It effectively isolates failure categories to individual conversation turns.

03

Dynamic curation reduces label noise, improving evaluation accuracy.

Abstract

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

layer6ai-labs/RankJudge
github

Datasets

Layer6/RankJudge
dataset· 170 dl
170 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.