reward-lens: A Mechanistic Interpretability Library for Reward Models

Mohammed Suhail B Nadaf

arXiv:2604.26130·cs.LG·April 30, 2026

reward-lens: A Mechanistic Interpretability Library for Reward Models

Mohammed Suhail B Nadaf

PDF

TL;DR

reward-lens is an open-source interpretability library adapted for reward models, providing tools to analyze and compare reward model primitives and uncover interpretability challenges.

Contribution

It extends existing interpretability tools to reward models by introducing a new library with novel methods and a unified framework for analysis.

Findings

01

Linear attribution does not predict causal patching effects.

02

The framework exposes disagreements between observational and causal interpretability.

03

Validated on two production reward models with ~695 RewardBench pairs.

Abstract

Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, activation patching, sparse autoencoders -- was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector $w_{r}$ is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.