Interpreting Language Reward Models via Contrastive Explanations
Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso

TL;DR
This paper introduces a contrastive explanation framework for reward models in language models, enhancing interpretability and trust by analyzing how RMs respond to modifications in evaluation attributes.
Contribution
It proposes a novel contrastive explanation method for RMs, enabling detailed analysis of their local behavior and attribute sensitivity, improving transparency.
Findings
Effective in generating high-quality contrastive explanations
Reveals global sensitivity of RMs to evaluation attributes
Automatically extracts representative examples for RM behavior comparison
Abstract
Reward models (RMs) are a crucial component in the alignment of large language models' (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM's local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded.…
Peer Reviews
Decision·ICLR 2025 Poster
1. A novel prompting-based counterfactual and semifactual example generation method for explaining reward models 1. Experiments do show some merits of the proposed method, such as generally improved CF success rate (CF coverage).
1. The work's main motivation, "More transparent RMs would enable improved trust in the alignment of LLMs," is mostly intuitive and needs more explanations. Even if reward models are totally transparent, how they translate to more transparency in the trained policy model is still not clear, as the policy LLM is still a black box model. Math question training data, as an extreme case, often comes with a more rigid threshold of logical consistency in its data and does not translate to improved rea
I am not very familiar with the XAI literature or its application to reward models, but I have not seen this particular approach before and believe the methodology is novel (but see weaknesses for some requests on related work). I think researchers who work with reward models will generally be interested in ways of explaining the reward model predictions, and so I believe the topic and proposal in this paper meets the significance threshold for a major conference (but see weaknesses re: chosen R
Related work: At lines 179-181, a few related works are mentioned that could alternatively do perturbations or CFs for reward model explanations. These are not expanded upon in the text, and they do not seem related to the baselines at lines 277 - 284. Is this because they do not apply at all, or how do the baselines relate to past work? This casts some doubt on how the present work fits into the literature. Conceptual: It seems the finite attribute list limits the kinds of explanations that t
1. The proposed technique is clean and simple, draws from existing literature on SF/CFs in XAI, and importantly is general enough to work for any RM (L.132-134), making it a useful tool for analysis. This is particularly the case when you might not have access to the training data of some black-box RMs and can yet obtain qualitative insights on their performance. 2. The main contribution is the analysis technique. Section 3 makes a good case that the proposed prompting technique is an effectiv
1. While the experiments generally back up the claims made, I think there is scope to more rigorously test the robustness of the proposed technique. See Q2, 3, and 7 below for specific concerns. Another concern is that the generation method for attributes looks at examples from the test set. This is fine, but it grounds the perturbations to 'expected' ways and doesn't elicit any 'surprising' biases from RMs. This is easily remedied of course, simply by sampling attributes along different ways, b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
