Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders
Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, Hari Sundaram

TL;DR
This paper introduces a novel detoxification method for large language models using sparse autoencoders to perform targeted activation steering, effectively reducing toxicity while maintaining language fluency and model capabilities.
Contribution
The paper proposes a new SAE-based causal intervention technique for LLM detoxification, demonstrating its effectiveness and analyzing the trade-offs involved.
Findings
Up to 20% reduction in toxicity compared to baselines
Steering can degrade fluency depending on aggressiveness
Model knowledge remains stable after steering
Abstract
Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning
MethodsCosine Annealing · Linear Warmup With Cosine Annealing · Softmax · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay
