TL;DR
This paper introduces Graph-Regularized Sparse Autoencoders (GSAE), a novel method for steering large language models towards safety by learning distributed activation directions that improve refusal of harmful requests.
Contribution
GSAE incorporates graph-based regularization into sparse autoencoders to enhance safety steering in LLMs, outperforming existing methods and generalizing across multiple models.
Findings
GSAE increases harmful-request refusal rates significantly.
GSAE maintains benign-task performance.
GSAE generalizes across different LLM architectures.
Abstract
Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
Novel approach: The application of graph Laplacian regularization to SAEs for safety is creative. The insight that safety is distributed rather than monosemantic is well-articulated and supported by recent literature. Comprehensive empirical evaluation: Authors conducted tests across multiple model families (LLaMA-3, Mistral, Qwen, Phi), and evaluate against diverse jailbreak attacks (GCG, AutoDAN, TAP). The evaluation includes both safety and utility benchmarks
**Major Weaknesses** 1. Insufficient Evidence for Claims The paper's central hypothesis—that safety requires distributed representations—lacks direct empirical support: * Figure 4 shows overlapping spectral projections but doesn't definitively prove that distributed representations are necessary for safety. * The evidence comes from analogy to temporal/refusal behavior studies, not direct investigation of safety concepts. * An experiment comparing GSAE against methods that explicitly enforce
1. Formulates safety representation learning as a graph-signal smoothness problem, integrating Graph Laplacian regularization into the SAE framework, is conceptually clear and technically novel. 2. Provides broad empirical validation across multiple models and attack types, comparing with several strong baselines.
1. Does not explicitly model or measure cumulative drift caused by multi-layer steering, leaving the potential interaction between layers unaddressed. 2. The three-stage feature selection pipeline is largely heuristic, requires multiple sub-trainings and hyperparameter tuning, which may hinder reproducibility. 3. The Safety–Utility evaluation remains coarse-grained, lacking fine-grained analysis of false refusals or real dialogue impact.
- The proposed method appears to be unsupervised (though it may use labels for tuning the thresholds $t_\text{hi}, t_\text{lo}$). - The experimental results are strong and well-support the claims.
- The paper lacks clarity in several aspects: - The pooled representation is denoted as $\bar{h}^{(l)}$ or $H$, but is referred to as $x$ in L256–258. - In L289–296, could the authors clarify how $s_i^\text{lap}$ and $s_i^\text{infl}$ are computed? - In L310–315, thresholds $t_\text{hi}$ and $t_\text{lo}$ are said to be selected via a “systematic sensitivity analysis.” Is this analysis conducted using training or test labels? - In L311, what is the function $g$ used to compute $p_\text{h
1. **Well-Motivated Problem.** The paper is built on the strong and timely hypothesis that abstract concepts like safety are fundamentally distributed. This provides a principled explanation for why standard SAEs, which are optimized for monosemanticity, may be ill-suited for this particular control task. 2. **Principled Methodology (GSAE).** This paper introduces a new application of a graph Laplacian smoothness prior to the SAE's decoder weights ($W^{(d)}$). This adaptation of a technique from
1. **Compositional Novelty.** The paper's core contribution is a novel integration of existing techniques (Sparse Autoencoders, Laplacian regularization, and gating) rather than the invention of a fundamentally new algorithm. While this composition is highly effective and achieves state-of-the-art results, the methodological leap itself could be viewed as incremental. 2. **Hyperparameter Complexity.** The full framework introduces a large number of new hyperparameters, including $\lambda_{graph}
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
