Large Language Models can be Strong Self-Detoxifiers
Ching-Yun Ko, Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan,, Georgios Kollias, Subhajit Chaudhury, Tejaswini Pedapati, Luca Daniel

TL;DR
This paper introduces SASA, a lightweight decoding method that enables large language models to self-detoxify by dynamically steering away from toxic outputs using internal representations, without extra training or reward models.
Contribution
The paper presents SASA, a novel, efficient decoding algorithm that allows LLMs to reduce toxicity internally without additional training or external reward models.
Findings
SASA significantly reduces toxicity across multiple LLMs.
SASA achieves comparable detoxification performance to state-of-the-art methods.
SASA improves output quality while maintaining model capabilities.
Abstract
Reducing the likelihood of generating harmful and toxic output is an essential task when aligning large language models (LLMs). Existing methods mainly rely on training an external reward model (i.e., another language model) or fine-tuning the LLM using self-generated data to influence the outcome. In this paper, we show that LLMs have the capability of self-detoxification without the use of an additional reward model or re-training. We propose \textit{Self-disciplined Autoregressive Sampling (SASA)}, a lightweight controlled decoding algorithm for toxicity reduction of LLMs. SASA leverages the contextual representations from an LLM to learn linear subspaces characterizing toxic v.s. non-toxic output in analytical forms. When auto-completing a response token-by-token, SASA dynamically tracks the margin of the current output to steer the generation away from the toxic subspace, by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science
MethodsStand-Alone Self Attention
