Autorubric: Unifying Rubric-based LLM Evaluation
Delip Rao, Chris Callison-Burch

TL;DR
Autorubric is an open-source framework that unifies rubric-based evaluation of large language models, enabling reliable, customizable, and scalable assessment across diverse benchmarks with demonstrated improvements.
Contribution
It introduces a comprehensive, open-source toolkit for rubric-based LLM evaluation, integrating multiple techniques with default settings and validating on various benchmarks.
Findings
Achieved 80% accuracy on RiceChem with 5-shot calibration.
Demonstrated high agreement (87%) on CHARM-100 dataset.
Improved peer review scores from 0.47 to 0.85 using Autorubric explanations.
Abstract
Techniques for reliable rubric-based LLM evaluation -- ensemble judging, bias mitigation, few-shot calibration -- are scattered across papers with inconsistent terminology and partial implementations. We introduce Autorubric, an open-source framework that unifies these rubric-based LLM evaluation lessons with opinionated defaults: analytic rubrics with binary, ordinal, and nominal criteria; single-judge and ensemble evaluation; few-shot calibration; bias mitigations; and psychometric reliability metrics. We validate on three benchmarks: RiceChem (college chemistry grading, 80\% accuracy with 5-shot calibration), ResearcherBench (deep research evaluation, 931 criteria, cross-judge agreement analysis), and CHARM-100, a new chatbot evaluation dataset combining all three criterion types with ground truth labels (87\% binary accuracy, moderate-to-substantial ). Beyond measurement,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
