Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Maximilian Li; Xander Davies; Max Nadeau

arXiv:2309.05973·cs.CL·January 31, 2024·5 cites

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Maximilian Li, Xander Davies, Max Nadeau

PDF

Open Access 1 Repo

TL;DR

This paper introduces a targeted ablation method to remove undesirable behaviors in language models by disabling specific causal pathways, effectively reducing toxic outputs with minimal impact on overall performance.

Contribution

It presents a novel approach to identify and ablate a small set of causal pathways to mitigate harmful behaviors in language models.

Findings

01

Ablating 12 causal edges reduces GPT-2 toxic language generation

02

Minimal performance degradation on other inputs

03

Effective removal of undesirable behaviors with small ablation

Abstract

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xanderdavies/circuit-breaking
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Dropout · Discriminative Fine-Tuning · Residual Connection · Adam · Weight Decay · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia?