reward-lens: A Mechanistic Interpretability Library for Reward Models
Mohammed Suhail B Nadaf

TL;DR
reward-lens is an open-source interpretability library adapted for reward models, providing tools to analyze and compare reward model primitives and uncover interpretability challenges.
Contribution
It extends existing interpretability tools to reward models by introducing a new library with novel methods and a unified framework for analysis.
Findings
Linear attribution does not predict causal patching effects.
The framework exposes disagreements between observational and causal interpretability.
Validated on two production reward models with ~695 RewardBench pairs.
Abstract
Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, activation patching, sparse autoencoders -- was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
