An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability

Yusuke Yamauchi; Taro Yano; Masafumi Oyamada

arXiv:2506.13639·cs.CL·June 17, 2025·2 cites

An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability

Yusuke Yamauchi, Taro Yano, Masafumi Oyamada

PDF

Open Access

TL;DR

This study investigates how different design choices in using large language models as evaluators affect the reliability and alignment of their judgments with human preferences, emphasizing evaluation criteria, sampling methods, and reasoning strategies.

Contribution

It provides an empirical analysis of factors influencing LLM-based evaluation reliability, highlighting the importance of evaluation criteria and sampling strategies.

Findings

01

Evaluation criteria are crucial for reliability.

02

Non-deterministic sampling improves alignment with human judgments.

03

Chain-of-Thought reasoning offers minimal gains with clear criteria.

Abstract

As large language models (LLMs) continue to advance, reliable evaluation methods are essential particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency. Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Tought (CoT) reasoning in evaluation. Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLegal Education and Practice Innovations · Law, Economics, and Judicial Systems · Artificial Intelligence in Law