Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin, Gongjun Xu, Moulinath Banerjee, Yuekai Sun

TL;DR
This paper introduces Bridge, a statistical framework that aligns LLM evaluations with human judgments, improving agreement and revealing systematic differences between them.
Contribution
Bridge provides a novel, unified method to model and correct discrepancies between human and LLM evaluations, enhancing the reliability of LLM-based judging.
Findings
Bridge improves agreement with human ratings across multiple benchmarks.
It exposes systematic gaps between human and LLM judgments.
The framework offers efficient, statistically sound inference methods.
Abstract
Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCorporate Governance and Law · Taxation and Legal Issues · Conflict of Laws and Jurisdiction
