Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

Junhyuk Choi; Sohhyung Park; Chanhee Cho; Hyeonchu Park; Bugeun Kim

arXiv:2602.00521·cs.AI·February 3, 2026

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

Junhyuk Choi, Sohhyung Park, Chanhee Cho, Hyeonchu Park, Bugeun Kim

PDF

Open Access

TL;DR

This paper introduces a two-phase diagnostic framework based on Item Response Theory to evaluate the reliability of LLM-as-a-Judge, focusing on consistency and alignment with human judgments, providing interpretable signals for systematic diagnosis.

Contribution

It presents a novel IRT-based framework for diagnosing LLM-as-a-Judge reliability, addressing stability and human alignment in evaluation.

Findings

01

IRT-GRM yields interpretable diagnostic signals

02

Framework effectively assesses stability under prompt variations

03

Identifies causes of unreliability in LLM judgments

Abstract

While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPsychometric Methodologies and Testing · Explainable Artificial Intelligence (XAI) · Reliability and Agreement in Measurement