Robust Reward Modeling via Causal Rubrics
Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, Doina Precup

TL;DR
Crome introduces a causal framework with targeted augmentations to improve reward models' robustness against superficial cues, leading to better alignment of LLMs with true quality metrics.
Contribution
The paper presents Crome, a novel causal reward modeling framework that uses synthetic augmentations to disentangle causal and spurious factors, reducing reward hacking.
Findings
Crome outperforms standard baselines on RewardBench by up to 5.4% accuracy.
Achieves up to 13.2% and 7.2% improvements in specific categories.
Demonstrates consistent gains across various benchmarks and inference settings.
Abstract
Reward models (RMs) are fundamental to aligning Large Language Models (LLMs) via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce Crome (Causally Robust Reward Modeling), a novel framework grounded in an explicit causal model designed to mitigate reward hacking. Crome employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes, to enforce sensitivity along each causal attribute individually, and (2)…
Peer Reviews
Decision·ICLR 2026 Poster
The paper tackles the important problem of improving preference reward models’ robustness and sensitivity towards spurious correlations. Their synthetic data generation approach both addresses how aware reward models are of real/important attributes in the completions vs. how invariant they are towards spurious attributes. They performed a large set of experiments to showcase the effectiveness of their approach, such as evaluating their model on RewardBench and reWordBench, which is more advers
- Clarity: The paper’s writing could be improved and certain experimental design choices could be explained more clearly. Additionally, the paper’s main text very often discusses results from tables in the appendix, which makes it harder to read. Some examples where clarity could be improved are: It is never specified what reward model is being trained, at what size (same for the baselines). - Analysis of the data: The paper is missing some details on the synthetic data that you generate, such
This paper's experiments are carefully designed, and focused on an active area of work with reasonable baselines and good results on popular benchmarks. Their work defines clear methodology for using synthetic data generation methods to create augmentations to preference data, creating more robust and effective reward models. They also evaluate their results on best-of-n rankings for popular benchmarks, which has shown to be even better correlated with downstream performance after performing onl
Some of the mathematical notation feels unnecessary, and I feel like it obfuscates the (reasonable) points being made at times. E.g. much of the mathematical description in sections 3 and 4 could instead be turned into natural prose, which would be more readily understandable to people less familiar with reward modeling, etc. The points your paper is making are good, but it can be hard to fully parse the paper at times.
The core idea is elegant and generally clearly explained. The ideas behind Crome and the specific setup – augmenting data while focusing on causal/spurious attributes, appear rather novel and well motivated. The experiments are extensive and present strong empirical evidence that the approach is more effective than the baselines selected, across a range of settings and diverse benchmarks. The paper is quite thorough in using extant base models and benchmarks to evaluate whether the approach
The main paper does not include a single example. The formal notation is appreciated, but the intuition would be a lot easier to grasp with an accompanying example. (E.g., I got hung up on the phrase “flipping the question” in one of the figures. Awkward nomenclature would matter a lot less if the reader saw a simple illustration of the idea.) Crome relies heavily on the oracle LM. The central assumption that language models can generate reliable causal rubrics is worth evaluating empiricall
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare
