Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
HyunJoon Jung, William Na

TL;DR
This paper evaluates the reliability of LLM-based agent judges in conversational AI, revealing how quality scores and issue discoveries scale with panel size and the importance of persona conditioning.
Contribution
It demonstrates that persona-based agent judges can match human evaluations and uncovers the power-law relationship between panel size and issue discovery.
Findings
Quality scores improve logarithmically with panel size.
Unique issue discoveries follow a sublinear power law with diminishing returns.
Persona conditioning influences the diversity and effectiveness of evaluations.
Abstract
LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
