Applying sparse autoencoders to unlearn knowledge in language models

Eoin Farrell; Yeu-Tong Lau; Arthur Conmy

arXiv:2410.19278·cs.LG·November 5, 2024

Applying sparse autoencoders to unlearn knowledge in language models

Eoin Farrell, Yeu-Tong Lau, Arthur Conmy

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper explores the use of sparse autoencoders to selectively unlearn specific knowledge in language models, focusing on biology-related information, and compares their effectiveness to existing fine-tuning methods.

Contribution

It demonstrates that interpretable SAE features can unlearn biology knowledge with minimal side-effects, highlighting the importance of negative scaling and the limitations of feature ablation.

Findings

01

SAE features can unlearn biology knowledge with minimal side-effects

02

Negative scaling of features is necessary for effective unlearning

03

Current SAE techniques are less effective than fine-tuning for unlearning tasks

Abstract

We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from language models. We use the biology subset of the Weapons of Mass Destruction Proxy dataset and test on the gemma-2b-it and gemma-2-2b-it language models. We demonstrate that individual interpretable biology-related SAE features can be used to unlearn a subset of WMDP-Bio questions with minimal side-effects in domains other than biology. Our results suggest that negative scaling of feature activations is necessary and that zero ablating features is ineffective. We find that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects than the existing Representation Misdirection for Unlearning technique. Current SAE quality or intervention techniques would need to improve to make SAE-based unlearning comparable to the…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

I think applying SAEs to this task is useful – for us to do good unlearning we almost certainly want an interpretable method, so these are worthwhile first steps. I like the depth you went into in Section 3, as well as Figure 2. I think the methodology was clearly defined, as well as the metrics and tasks you were evaluating. I think there are some nice ablations as well, e.g., Section 4.2.

Weaknesses

I think the messaging of the paper needs to change to increase the novelty by emphasizing how your work and existing work (RMU) differ. Specifically, how yours is more interpretable and why that’s a good thing. I know the latter is mentioned in the intro but that’s the most substantive discussion of this difference, which is the main reason right now people would want to use SAEs for unlearning. I think the experiments section should include both gemma models. Relatedly, I think Figure 4 is wea

Reviewer 02Rating 5Confidence 3

Strengths

* Unlearning * Minimal side effects

Weaknesses

* Not concise: I still don't fully understand what a SAE is. The entire paper proposes a new methodological framework without a single math equation. It is very hard to follow and reads like a conversation between LLM software engineers moreso than a technical report on a new methodology. * Plots everywhere: why is figure 10 cited an entire page before figure 2? The figures should be placed in close proximity to the text in which it is being discussed * What causes the drop on OpenWebText? What

Reviewer 03Rating 5Confidence 4

Strengths

- The paper addresses an important and current issue in AI safety, focusing on controlled knowledge removal. - The authors present a thorough analysis of how individual SAE features can be targeted to unlearn specific knowledge, showcasing the possibility for precise, fine-grained control.

Weaknesses

- Novelty: I am not sure about the difference between the paper’s method and negative activation in [1] and [2]. - Unlearning Performance: The submission underperforms relative to existing unlearning methods, notably RMU, as benchmarked by the WMDP. While it proposes an innovative approach, it fails to deliver superior results compared to RMU across several metrics, raising concerns about its effectiveness and relevance in high-stakes applications. - Helpfulness: Furthermore, the exact MMLU accu

Code & Models

Repositories

efarrell1/train_sparse_autoencoder
pytorchOfficial

Models

🤗
AMindToThink/gemma-2-2b-it_RMU_s100_a100_layer3
model· 2 dl
2 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling