RubricBench: Aligning Model-Generated Rubrics with Human Standards
Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, Chen Ma

TL;DR
RubricBench is a new benchmark designed to evaluate the reliability of rubric-based evaluation methods for LLMs, highlighting significant gaps between human and model-generated rubrics in complex assessment scenarios.
Contribution
The paper introduces RubricBench, a comprehensive benchmark with expert-annotated rubrics and challenging samples to assess rubric-based evaluation of language models.
Findings
Models lag behind humans in rubric specification.
State-of-the-art models struggle with nuanced evaluation criteria.
RubricBench reveals significant performance gaps in current evaluation methods.
Abstract
As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Topic Modeling
