Autorubric: Unifying Rubric-based LLM Evaluation

Delip Rao; Chris Callison-Burch

arXiv:2603.00077·cs.CL·April 7, 2026

Autorubric: Unifying Rubric-based LLM Evaluation

Delip Rao, Chris Callison-Burch

PDF

TL;DR

Autorubric is an open-source framework that unifies rubric-based evaluation of large language models, enabling reliable, customizable, and scalable assessment across diverse benchmarks with demonstrated improvements.

Contribution

It introduces a comprehensive, open-source toolkit for rubric-based LLM evaluation, integrating multiple techniques with default settings and validating on various benchmarks.

Findings

01

Achieved 80% accuracy on RiceChem with 5-shot calibration.

02

Demonstrated high agreement (87%) on CHARM-100 dataset.

03

Improved peer review scores from 0.47 to 0.85 using Autorubric explanations.

Abstract

Techniques for reliable rubric-based LLM evaluation -- ensemble judging, bias mitigation, few-shot calibration -- are scattered across papers with inconsistent terminology and partial implementations. We introduce Autorubric, an open-source framework that unifies these rubric-based LLM evaluation lessons with opinionated defaults: analytic rubrics with binary, ordinal, and nominal criteria; single-judge and ensemble evaluation; few-shot calibration; bias mitigations; and psychometric reliability metrics. We validate on three benchmarks: RiceChem (college chemistry grading, 80\% accuracy with 5-shot calibration), ResearcherBench (deep research evaluation, 931 criteria, cross-judge agreement analysis), and CHARM-100, a new chatbot evaluation dataset combining all three criterion types with ground truth labels (87\% binary accuracy, moderate-to-substantial $κ$ ). Beyond measurement,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.