Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski; Stephan Kleber

arXiv:2603.22214·cs.CR·March 24, 2026

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski, Stephan Kleber

PDF

Open Access

TL;DR

This paper evaluates the reliability of large language models acting as automated judges for assessing other LLMs, demonstrating high correlation with human judgments across various models and prompts.

Contribution

It systematically investigates the effectiveness of 37 LLMs, multiple prompts, and fine-tuned models as automated judges, providing insights into their reliability and agreement with human assessments.

Findings

01

High correlation of LLM judges with human assessments when using suitable prompts.

02

GPT-4o and certain open-source models show strong performance as judges.

03

Automated judging scales effectively across diverse evaluation categories.

Abstract

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Adversarial Robustness in Machine Learning