TL;DR
This paper introduces diagnostic tools for assessing the reliability of LLM-based judges in NLG evaluation, revealing inconsistencies and proposing conformal prediction sets for per-instance reliability measurement.
Contribution
It presents a transitivity analysis and conformal prediction methodology to evaluate and improve the reliability of LLM judges, with theoretical guarantees and cross-judge agreement evidence.
Findings
Widespread per-input inconsistency masked by low aggregate violation rates
Prediction set width correlates with document difficulty and shows cross-judge agreement
Relevance is judged most reliably, while fluency and consistency are less reliable.
Abstract
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates (-), with - of documents exhibiting at least one directed 3-cycle; and split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed coverage, with set width serving as a per-instance reliability indicator (, , , pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement (-), demonstrating it captures document-level difficulty rather than judge-specific noise.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
