Applying sparse autoencoders to unlearn knowledge in language models
Eoin Farrell, Yeu-Tong Lau, Arthur Conmy

TL;DR
This paper explores the use of sparse autoencoders to selectively unlearn specific knowledge in language models, focusing on biology-related information, and compares their effectiveness to existing fine-tuning methods.
Contribution
It demonstrates that interpretable SAE features can unlearn biology knowledge with minimal side-effects, highlighting the importance of negative scaling and the limitations of feature ablation.
Findings
SAE features can unlearn biology knowledge with minimal side-effects
Negative scaling of features is necessary for effective unlearning
Current SAE techniques are less effective than fine-tuning for unlearning tasks
Abstract
We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from language models. We use the biology subset of the Weapons of Mass Destruction Proxy dataset and test on the gemma-2b-it and gemma-2-2b-it language models. We demonstrate that individual interpretable biology-related SAE features can be used to unlearn a subset of WMDP-Bio questions with minimal side-effects in domains other than biology. Our results suggest that negative scaling of feature activations is necessary and that zero ablating features is ineffective. We find that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects than the existing Representation Misdirection for Unlearning technique. Current SAE quality or intervention techniques would need to improve to make SAE-based unlearning comparable to the…
Peer Reviews
Decision·Submitted to ICLR 2025
I think applying SAEs to this task is useful – for us to do good unlearning we almost certainly want an interpretable method, so these are worthwhile first steps. I like the depth you went into in Section 3, as well as Figure 2. I think the methodology was clearly defined, as well as the metrics and tasks you were evaluating. I think there are some nice ablations as well, e.g., Section 4.2.
I think the messaging of the paper needs to change to increase the novelty by emphasizing how your work and existing work (RMU) differ. Specifically, how yours is more interpretable and why that’s a good thing. I know the latter is mentioned in the intro but that’s the most substantive discussion of this difference, which is the main reason right now people would want to use SAEs for unlearning. I think the experiments section should include both gemma models. Relatedly, I think Figure 4 is wea
* Unlearning * Minimal side effects
* Not concise: I still don't fully understand what a SAE is. The entire paper proposes a new methodological framework without a single math equation. It is very hard to follow and reads like a conversation between LLM software engineers moreso than a technical report on a new methodology. * Plots everywhere: why is figure 10 cited an entire page before figure 2? The figures should be placed in close proximity to the text in which it is being discussed * What causes the drop on OpenWebText? What
- The paper addresses an important and current issue in AI safety, focusing on controlled knowledge removal. - The authors present a thorough analysis of how individual SAE features can be targeted to unlearn specific knowledge, showcasing the possibility for precise, fine-grained control.
- Novelty: I am not sure about the difference between the paper’s method and negative activation in [1] and [2]. - Unlearning Performance: The submission underperforms relative to existing unlearning methods, notably RMU, as benchmarked by the WMDP. While it proposes an innovative approach, it fails to deliver superior results compared to RMU across several metrics, raising concerns about its effectiveness and relevance in high-stakes applications. - Helpfulness: Furthermore, the exact MMLU accu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
