Graph-Regularized Sparse Autoencoders for LLM Safety Steering

Jehyeok Yeon; Federico Cinus; Yifan Wu; Luca Luceri

arXiv:2512.06655·cs.LG·May 18, 2026

Graph-Regularized Sparse Autoencoders for LLM Safety Steering

Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri

PDF

4 Reviews

TL;DR

This paper introduces Graph-Regularized Sparse Autoencoders (GSAE), a novel method for steering large language models towards safety by learning distributed activation directions that improve refusal of harmful requests.

Contribution

GSAE incorporates graph-based regularization into sparse autoencoders to enhance safety steering in LLMs, outperforming existing methods and generalizing across multiple models.

Findings

01

GSAE increases harmful-request refusal rates significantly.

02

GSAE maintains benign-task performance.

03

GSAE generalizes across different LLM architectures.

Abstract

Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 4Confidence 4

Strengths

Novel approach: The application of graph Laplacian regularization to SAEs for safety is creative. The insight that safety is distributed rather than monosemantic is well-articulated and supported by recent literature. Comprehensive empirical evaluation: Authors conducted tests across multiple model families (LLaMA-3, Mistral, Qwen, Phi), and evaluate against diverse jailbreak attacks (GCG, AutoDAN, TAP). The evaluation includes both safety and utility benchmarks

Weaknesses

**Major Weaknesses** 1. Insufficient Evidence for Claims The paper's central hypothesis—that safety requires distributed representations—lacks direct empirical support: * Figure 4 shows overlapping spectral projections but doesn't definitively prove that distributed representations are necessary for safety. * The evidence comes from analogy to temporal/refusal behavior studies, not direct investigation of safety concepts. * An experiment comparing GSAE against methods that explicitly enforce

Reviewer 02Rating 6Confidence 3

Strengths

1. Formulates safety representation learning as a graph-signal smoothness problem, integrating Graph Laplacian regularization into the SAE framework, is conceptually clear and technically novel. 2. Provides broad empirical validation across multiple models and attack types, comparing with several strong baselines.

Weaknesses

1. Does not explicitly model or measure cumulative drift caused by multi-layer steering, leaving the potential interaction between layers unaddressed. 2. The three-stage feature selection pipeline is largely heuristic, requires multiple sub-trainings and hyperparameter tuning, which may hinder reproducibility. 3. The Safety–Utility evaluation remains coarse-grained, lacking fine-grained analysis of false refusals or real dialogue impact.

Reviewer 03Rating 2Confidence 4

Strengths

- The proposed method appears to be unsupervised (though it may use labels for tuning the thresholds $t_\text{hi}, t_\text{lo}$). - The experimental results are strong and well-support the claims.

Weaknesses

- The paper lacks clarity in several aspects: - The pooled representation is denoted as $\bar{h}^{(l)}$ or $H$, but is referred to as $x$ in L256–258. - In L289–296, could the authors clarify how $s_i^\text{lap}$ and $s_i^\text{infl}$ are computed? - In L310–315, thresholds $t_\text{hi}$ and $t_\text{lo}$ are said to be selected via a “systematic sensitivity analysis.” Is this analysis conducted using training or test labels? - In L311, what is the function $g$ used to compute $p_\text{h

Reviewer 04Rating 4Confidence 3

Strengths

1. **Well-Motivated Problem.** The paper is built on the strong and timely hypothesis that abstract concepts like safety are fundamentally distributed. This provides a principled explanation for why standard SAEs, which are optimized for monosemanticity, may be ill-suited for this particular control task. 2. **Principled Methodology (GSAE).** This paper introduces a new application of a graph Laplacian smoothness prior to the SAE's decoder weights ($W^{(d)}$). This adaptation of a technique from

Weaknesses

1. **Compositional Novelty.** The paper's core contribution is a novel integration of existing techniques (Sparse Autoencoders, Laplacian regularization, and gating) rather than the invention of a fundamentally new algorithm. While this composition is highly effective and achieves state-of-the-art results, the methodological leap itself could be viewed as incremental. 2. **Hyperparameter Complexity.** The full framework introduces a large number of new hyperparameters, including $\lambda_{graph}

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks