RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
Zhengyang Qi, Charles Dickens, Derek Pham, Amanda Dsouza, Armin Parchami, Frederic Sala, Paroma Varma

TL;DR
RIFT introduces a systematic taxonomy for diagnosing failure modes in rubric-based evaluation of language models, enabling better understanding and improvement of rubric design.
Contribution
The paper develops a comprehensive failure mode taxonomy for rubrics, grounded in diverse data sources, and proposes automated metrics aligned with human annotations.
Findings
Identified eight failure modes across three categories in rubric design.
Achieved 87% agreement among annotators and 0.64 Cohen's kappa in taxonomy consistency.
Automated metrics align with human failure annotations with up to 0.925 F1.
Abstract
Rubric-based evaluation is widely used in LLM benchmarks and training pipelines for open-ended, less verifiable tasks. While prior work has demonstrated the effectiveness of rubrics using downstream signals such as reinforcement learning outcomes, there remains no principled way to diagnose how a rubric itself fails from such aggregated or downstream signals alone. To address this gap, we introduce RIFT: RubrIc Failure mode Taxonomy, a taxonomy for systematically characterizing failure modes in rubric composition and design. RIFT consists of eight failure modes organized into three high-level categories: Reliability Failures, Content Validity Failures, and Consequential Validity Failures. RIFT is developed using grounded theory by iteratively annotating rubrics drawn from five diverse data sources spanning general instruction following, code generation, creative writing, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
