SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Vegard Flovik

arXiv:2512.15938·cs.LG·March 10, 2026

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Vegard Flovik

PDF

Open Access 3 Reviews

TL;DR

SALVE introduces a novel autoencoder-based framework that enables interpretable feature discovery and precise, permanent model editing in neural networks, enhancing transparency and control.

Contribution

The paper presents SALVE, a unified framework that combines sparse autoencoder feature learning with model editing, bridging interpretability and control in neural networks.

Findings

01

Validated on ResNet-18 and ViT-B/16 models

02

Achieved interpretable control over model behavior

03

Provided a method for precise weight-space interventions

Abstract

Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified "discover, validate, and control" framework that bridges mechanistic interpretability and model editing. Using an $ℓ_{1}$ -regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, $α_{cr i t}$ , quantifying each class's reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

- The paper presents a methodologically sound framework with experimental validation across different model architectures. - The proposed method effectively enables permanent weight-space interventions by directly modifying model weights guided by discovered latent features.

Weaknesses

- While the paper proposes a generic framework integrating sparse autoencoders with weight-space interventions, the individual technical components are largely adaptations of existing methods. - The comparison with activation steering methods is mentioned but not thoroughly explored empirically. Please see questions below for details

Reviewer 02Rating 2Confidence 4

Strengths

* Originality: Clear, cohesive pipeline that connects unsupervised SAE features to weight-space edits, not only inference-time steering. The $\alpha_\mathrm{crit}$ metric offers a concrete knob to quantify reliance per sample, which is useful for diagnostics * Quality: Solid derivation for the analytic approximation of $\alpha_\mathrm{crit}$ with a numerical check; careful discussion about when the linear approximation is reasonable for ResNet versus ViT * Clarity: Method is easy to follow, with

Weaknesses

* Only Imagenette is used and only two backbones are tested. There is no evaluation on harder datasets, no distribution shift stress tests, no adversarial or corruption benchmarks, and no human studies for interpretability quality. The ROME comparison is minimal and customized to the final layer rather than the standard internal-layer setting * The discovery component is a standard linear SAE with L1 sparsity. Grad-FAM adapts Grad-CAM to a latent feature target, which is straightforward. The con

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper introduces an interpretable model editing method to directly adapt model behavior. 2. Intervention sensitivity analysis establishes a quantitative suppression threshold to calibrate model edits.

Weaknesses

1. Using SAEs to identify concepts and highlight relevant image regions has been established, such as [1]. Similarly, suppressing model components to adapt model behavior has been explored in works like [2]. Therefore, the technical novelty of this work seems limited. 2. More extensive comparisons with recent model editing or concept-based intervention methods (e.g., [3]) are needed to better demonstrate potential advantages. 3. Experiments are conducted on the relatively small-scale Imagenett

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis