RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following

Tianjun Pan; Xuan Lin; Wenyan Yang; Qianyu He; Shisong Chen; Licai Qi; Wanqing Xu; Hongwei Feng; Bo Xu; Yanghua Xiao

arXiv:2603.25133·cs.AI·March 27, 2026

RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following

Tianjun Pan, Xuan Lin, Wenyan Yang, Qianyu He, Shisong Chen, Licai Qi, Wanqing Xu, Hongwei Feng, Bo Xu, Yanghua Xiao

PDF

Open Access

TL;DR

RubricEval introduces a comprehensive benchmark for assessing the accuracy of rubric-based evaluations of instruction-following in large language models, highlighting current limitations and avenues for improving judge reliability.

Contribution

It presents the first rubric-level meta-evaluation benchmark with diverse data, revealing the challenges and potential improvements in LLM judging accuracy.

Findings

01

GPT-4o achieves 55.97% on Hard subset

02

Rubric-level evaluation outperforms checklist-level evaluation

03

Explicit reasoning enhances judgment accuracy

Abstract

Rubric-based evaluation has become a prevailing paradigm for evaluating instruction following in large language models (LLMs). Despite its widespread use, the reliability of these rubric-level evaluations remains unclear, calling for meta-evaluation. However, prior meta-evaluation efforts largely focus on the response level, failing to assess the fine-grained judgment accuracy that rubric-based evaluation relies on. To bridge this gap, we introduce RubricEval. Our benchmark features: (1) the first rubric-level meta-evaluation benchmark for instruction following, (2) diverse instructions and responses spanning multiple categories and model sources, and (3) a substantial set of 3,486 quality-controlled instances, along with Easy/Hard subsets that better differentiates judge performance. Our experiments reveal that rubric-level judging remains far from solved: even GPT-4o, a widely adopted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning