RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following
Tianjun Pan, Xuan Lin, Wenyan Yang, Qianyu He, Shisong Chen, Licai Qi, Wanqing Xu, Hongwei Feng, Bo Xu, Yanghua Xiao

TL;DR
RubricEval introduces a comprehensive benchmark for assessing the accuracy of rubric-based evaluations of instruction-following in large language models, highlighting current limitations and avenues for improving judge reliability.
Contribution
It presents the first rubric-level meta-evaluation benchmark with diverse data, revealing the challenges and potential improvements in LLM judging accuracy.
Findings
GPT-4o achieves 55.97% on Hard subset
Rubric-level evaluation outperforms checklist-level evaluation
Explicit reasoning enhances judgment accuracy
Abstract
Rubric-based evaluation has become a prevailing paradigm for evaluating instruction following in large language models (LLMs). Despite its widespread use, the reliability of these rubric-level evaluations remains unclear, calling for meta-evaluation. However, prior meta-evaluation efforts largely focus on the response level, failing to assess the fine-grained judgment accuracy that rubric-based evaluation relies on. To bridge this gap, we introduce RubricEval. Our benchmark features: (1) the first rubric-level meta-evaluation benchmark for instruction following, (2) diverse instructions and responses spanning multiple categories and model sources, and (3) a substantial set of 3,486 quality-controlled instances, along with Easy/Hard subsets that better differentiates judge performance. Our experiments reveal that rubric-level judging remains far from solved: even GPT-4o, a widely adopted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning
