Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
Jane Paik Kim

TL;DR
This paper proposes a formal, statistically grounded framework for augmenting human evaluation with LLM judges, optimizing the balance between human and AI ratings for reliable assessment.
Contribution
It introduces a two-stage sampling design and a doubly robust estimator to determine optimal sample sizes, improving evaluation efficiency and reliability.
Findings
Proposes a two-stage sampling design for evaluation.
Uses a doubly robust estimator to handle missing data.
Provides guidance on allocating human vs. LLM ratings.
Abstract
Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design. This paper (1) shifts the role of the LLM judge from substitutive to auxiliary, and (2) formulates the LLM-as-a-judge paradigm as one of augmenting human evaluation through a two-stage sampling design, where LLM evaluations are measured for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
