SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Wei Shi; Ziyuan Xie; Sihang Li; Xiang Wang

arXiv:2507.00665·cs.CL·February 2, 2026

SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Wei Shi, Ziyuan Xie, Sihang Li, Xiang Wang

PDF

Open Access 3 Reviews

TL;DR

SAFER introduces a novel method using sparse autoencoders to interpret, audit, and improve safety alignment in reward models for large language models, enabling targeted safety modifications without affecting overall performance.

Contribution

The paper presents SAFER, a framework that uses sparse autoencoders to uncover interpretable safety-related features in reward models, facilitating targeted safety improvements.

Findings

01

SAFER can precisely degrade safety alignment with minimal data.

02

SAFER can enhance safety alignment without harming general performance.

03

The approach enables interpretability and targeted safety interventions in reward models.

Abstract

Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present Sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

1. The proposed method (SAFER) is a precise and targeted intervention. Figure 1 and Table 2 show that SAFER's poisoning method causes a sharp decline in safety scores while leaving chat performance almost untouched. 2. SAFER can also work as a filtering method. In the denoising experiments, it shows an improvement in RM safety from 94.86 to 96.46 on the 3B model by removing just 4% of data. 3. Ablation experiments are sufficient; having a separate discussion on token-level and sequence-level is

Weaknesses

1. A significant methodological weakness is the reliance on a proprietary black-box GPT-4o to filter the features. The framework first uses a contrastive score to find candidate features, but then "use GPT-4o to interpret and assign safety relevance ratings" and only retains features with a perfect score of 5. Though the authors validate Human-GPT-4o alignment (Figure 5), but this is on their specific task, and it doesn't address the dependency. 2. The effect of removing denoising data does not

Reviewer 02Rating 6Confidence 4

Strengths

1. Innovatively using the classical mechanism interpretability tool SAE to study the reward model provides a completely new, more fundamental perspective for understanding the black box of RLHF, which is highly enlightening. 2. This paper not only stops at explanation but also conducts intervention experiments. In these experiments, data poisoning can significantly reduce security performance while almost unaffected general conversational ability. This demonstrates that SAFER indeed captures sp

Weaknesses

1. Only two smaller reward models (1B and 3B Llama-3.2-RM) were used, making it difficult to determine whether the proposed method can generalize well to other model architectures and model sizes. 2. SAFER demonstrates the strong correlation between certain features and safety decisions. However, this still remains at the observational level. The paper did not conduct causal intervention experiments , such as directly modifying the activation values of specific safety features through feature s

Reviewer 03Rating 4Confidence 3

Strengths

1. Mechanistic insight into reward models. Applies SAEs to reveal safety-relevant latent features, addressing a major interpretability gap in RLHF. 2. Targeted control of safety alignment. Demonstrates that feature-guided poisoning can selectively degrade safety without harming general chat performance. 3. Solid empirical setup. Uses both LLaMA-3.2-1B and 3B reward models, with careful ablations on layer choice, dictionary size, and sparsity. 4. Readable and reproducible. The methodology, hyperp

Weaknesses

1. Limited novelty. The paper mainly applies existing SAE methods to reward models; the core algorithmic contribution is modest. 2. Reliance on synthetic safety evaluation. “Safety” features and dataset manipulations are defined through model-generated or GPT-4o-labeled judgments rather than verified human annotation. 3. No causal interpretability. SAFER identifies correlational feature activations but does not test causal steering (e.g., changing activations to modify reward outcomes). 4. Narro

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Recommender Systems and Techniques