RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

David Reber; Sean Richardson; Todd Nief; Cristina Garbacea; Victor Veitch

arXiv:2410.11348·cs.CL·May 21, 2025

RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

David Reber, Sean Richardson, Todd Nief, Cristina Garbacea, Victor Veitch

PDF

Open Access 1 Repo

TL;DR

This paper introduces RATE, a method that uses large language models to measure the causal influence of response attributes on reward models, addressing biases from imperfect counterfactual rewrites.

Contribution

RATE is a novel approach that adjusts for rewrite imperfections, enabling accurate causal attribution of attributes in reward models.

Findings

01

RATE effectively measures attribute sensitivity in reward models.

02

The method reduces bias caused by imperfect counterfactual rewrites.

03

Empirical results validate the accuracy of RATE in various scenarios.

Abstract

Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

toddnief/rate
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Data Quality and Management · Topic Modeling

MethodsCounterfactuals Explanations