Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders

Agam Goyal; Vedant Rathi; William Yeh; Yian Wang; Yuen Chen; Hari Sundaram

arXiv:2505.14536·cs.CL·October 24, 2025

Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders

Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, Hari Sundaram

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel detoxification method for large language models using sparse autoencoders to perform targeted activation steering, effectively reducing toxicity while maintaining language fluency and model capabilities.

Contribution

The paper proposes a new SAE-based causal intervention technique for LLM detoxification, demonstrating its effectiveness and analyzing the trade-offs involved.

Findings

01

Up to 20% reduction in toxicity compared to baselines

02

Steering can degrade fluency depending on aggressiveness

03

Model knowledge remains stable after steering

Abstract

Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders· underline

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning

MethodsCosine Annealing · Linear Warmup With Cosine Annealing · Softmax · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay