SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks
Vegard Flovik

TL;DR
SALVE introduces a novel autoencoder-based framework that enables interpretable feature discovery and precise, permanent model editing in neural networks, enhancing transparency and control.
Contribution
The paper presents SALVE, a unified framework that combines sparse autoencoder feature learning with model editing, bridging interpretability and control in neural networks.
Findings
Validated on ResNet-18 and ViT-B/16 models
Achieved interpretable control over model behavior
Provided a method for precise weight-space interventions
Abstract
Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified "discover, validate, and control" framework that bridges mechanistic interpretability and model editing. Using an -regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, , quantifying each class's reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper presents a methodologically sound framework with experimental validation across different model architectures. - The proposed method effectively enables permanent weight-space interventions by directly modifying model weights guided by discovered latent features.
- While the paper proposes a generic framework integrating sparse autoencoders with weight-space interventions, the individual technical components are largely adaptations of existing methods. - The comparison with activation steering methods is mentioned but not thoroughly explored empirically. Please see questions below for details
* Originality: Clear, cohesive pipeline that connects unsupervised SAE features to weight-space edits, not only inference-time steering. The $\alpha_\mathrm{crit}$ metric offers a concrete knob to quantify reliance per sample, which is useful for diagnostics * Quality: Solid derivation for the analytic approximation of $\alpha_\mathrm{crit}$ with a numerical check; careful discussion about when the linear approximation is reasonable for ResNet versus ViT * Clarity: Method is easy to follow, with
* Only Imagenette is used and only two backbones are tested. There is no evaluation on harder datasets, no distribution shift stress tests, no adversarial or corruption benchmarks, and no human studies for interpretability quality. The ROME comparison is minimal and customized to the final layer rather than the standard internal-layer setting * The discovery component is a standard linear SAE with L1 sparsity. Grad-FAM adapts Grad-CAM to a latent feature target, which is straightforward. The con
1. The paper introduces an interpretable model editing method to directly adapt model behavior. 2. Intervention sensitivity analysis establishes a quantitative suppression threshold to calibrate model edits.
1. Using SAEs to identify concepts and highlight relevant image regions has been established, such as [1]. Similarly, suppressing model components to adapt model behavior has been explored in works like [2]. Therefore, the technical novelty of this work seems limited. 2. More extensive comparisons with recent model editing or concept-based intervention methods (e.g., [3]) are needed to better demonstrate potential advantages. 3. Experiments are conducted on the relatively small-scale Imagenett
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
