Trust or Escalate: LLM Judges with Provable Guarantees for Human   Agreement

Jaehun Jung; Faeze Brahman; Yejin Choi

arXiv:2407.18370·cs.LG·July 29, 2024·3 cites

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung, Faeze Brahman, Yejin Choi

PDF

Open Access

TL;DR

This paper introduces a framework for evaluating LLMs with provable guarantees of aligning with human judgment, using selective evaluation, confidence estimation, and cascading models to ensure high agreement levels.

Contribution

The paper proposes a novel selective evaluation framework with confidence estimation and cascaded models, providing provable guarantees of human agreement in LLM assessments.

Findings

01

Guarantees high alignment with human judgments.

02

Uses cheaper models effectively with provable trust.

03

Achieves over 80% human agreement with cost-effective models.

Abstract

We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on model preferences for pairwise evaluation, but rather assess the confidence of judge models and selectively decide when to trust its judgement. We then show that under this selective evaluation framework, human agreement can be provably guaranteed -- such that the model evaluation aligns with that of humans to a user-specified agreement level. As part of our framework, we also introduce Simulated Annotators, a novel confidence estimation method that significantly improves judge calibration and thus enables high coverage of evaluated instances. Finally, we propose Cascaded Selective Evaluation, where we use cheaper models as initial judges and escalate to stronger models only when necessary --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCorporate Law and Human Rights · Legal Systems and Judicial Processes · European and International Contract Law

MethodsAttention Is All You Need · Label Smoothing · Adam · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention