Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling

Kate Sanders; Nathaniel Weir; Sapana Chaudhary; Kaj Bostrom; and Huzefa Rangwala

arXiv:2602.06795·cs.CL·February 9, 2026

Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling

Kate Sanders, Nathaniel Weir, Sapana Chaudhary, Kaj Bostrom, and Huzefa Rangwala

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a data-driven method to create detailed reasoning error taxonomies, called rubrics, which improve error detection and reward modeling for LLMs in technical domains, reducing reliance on costly gold labels.

Contribution

It presents a novel approach to automatically generate granular reasoning rubrics that enhance error detection and reward modeling in domain-specific LLM applications.

Findings

01

Error classification with rubrics outperforms baselines in technical domains

02

Rewards based on rubrics improve task accuracy by +45%

03

Method reduces gold label requirements to 20% of traditional needs

Abstract

An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or "rubrics", demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models' task…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

- Novel application of automatic error taxonomy extraction to reasoning trace evaluation - Multiple domains tested (coding, math, chemistry) - Potential to reduce annotation costs in specialized domains

Weaknesses

- Rubric generation requires Claude 3.5 Sonnet (closed-source); no experiments with open-source alternatives - NuminaMath only evaluated on 100/350 validation problems - How do we know generated rubrics are comprehensive and not redundant?

Reviewer 02Rating 4Confidence 3

Strengths

1. This paper addresses a key limitation of LLMs: their difficulty in reliably identifying errors in complex reasoning traces, especially in expert domains (like coding or math) and on problems without simple verifiable answers. 1. The method is shown to be effective. When LLM judges were augmented with the automatically generated rubrics, their ability to correctly identify incorrect reasoning traces (Specificity) improved dramatically—for example, from 12.2% to 63.4% on SWE-Bench and 16.1% to

Weaknesses

1. A significant limitation highlighted in the appendix is the classifier's poor performance on the training set itself. The authors note that the low specificity scores indicate that the classifier is unable to re-identify these errors it was trained on without the ground truth answers being provided for guidance. 1. The ablation studies show that a larger rubric is not always better. For the coding domain, the smallest rubric size ($n=25$) actually outperformed most of the larger rubrics.

Reviewer 03Rating 4Confidence 3

Strengths

The paper targets a critical bottleneck in training more capable reasoning models: the difficulty and cost of creating reliable reward signals. The proposed solution is intuitive. Instead of asking an LLM to abstractly "grade" a complex trace, the authors use a approach to equip the LLM with a concrete "failure checklist" (the rubric). This reframes an abstract evaluation problem into a more constrained and verifiable classification task.

Weaknesses

The method’s cost-effectiveness remains insufficiently demonstrated. While Figure 6 indicates that the per-step runtime is comparable to the baseline, the baseline itself, relying on a single LLM-judge call already constitutes a major computational bottleneck in RL. A more detailed analysis of token usage and compute overhead between the two-pass rubric judge and the one-pass baseline would be needed to substantiate the claimed efficiency advantage. The method shows limited generalizability bey

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · AI-based Problem Solving and Planning