Verdict: A Library for Scaling Judge-Time Compute
Nimit Kalra, Leonard Tang

TL;DR
Verdict is an open-source library that enhances the accuracy, reliability, and interpretability of LLM-based judges by modular reasoning and increased compute, achieving competitive performance on various evaluation tasks.
Contribution
We introduce Verdict, a modular framework that scales judge-time compute to improve LLM judge quality across multiple evaluation tasks.
Findings
Verdict achieves performance comparable to larger fine-tuned judges.
It improves reliability and interpretability of automated evaluations.
Effective across tasks like moderation, fact-checking, and hallucination detection.
Abstract
The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units (such as verification, debate, and aggregation) and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieves performance competitive with orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Our framework establishes a foundation for scalable, interpretable, and reliable LLM-based evaluation systems for both researchers and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDispute Resolution and Class Actions · Artificial Intelligence in Law
MethodsLib
