Large Language Models can be Strong Self-Detoxifiers

Ching-Yun Ko; Pin-Yu Chen; Payel Das; Youssef Mroueh; Soham Dan,; Georgios Kollias; Subhajit Chaudhury; Tejaswini Pedapati; Luca Daniel

arXiv:2410.03818·cs.LG·October 8, 2024

Large Language Models can be Strong Self-Detoxifiers

Ching-Yun Ko, Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan,, Georgios Kollias, Subhajit Chaudhury, Tejaswini Pedapati, Luca Daniel

PDF

Open Access

TL;DR

This paper introduces SASA, a lightweight decoding method that enables large language models to self-detoxify by dynamically steering away from toxic outputs using internal representations, without extra training or reward models.

Contribution

The paper presents SASA, a novel, efficient decoding algorithm that allows LLMs to reduce toxicity internally without additional training or external reward models.

Findings

01

SASA significantly reduces toxicity across multiple LLMs.

02

SASA achieves comparable detoxification performance to state-of-the-art methods.

03

SASA improves output quality while maintaining model capabilities.

Abstract

Reducing the likelihood of generating harmful and toxic output is an essential task when aligning large language models (LLMs). Existing methods mainly rely on training an external reward model (i.e., another language model) or fine-tuning the LLM using self-generated data to influence the outcome. In this paper, we show that LLMs have the capability of self-detoxification without the use of an additional reward model or re-training. We propose \textit{Self-disciplined Autoregressive Sampling (SASA)}, a lightweight controlled decoding algorithm for toxicity reduction of LLMs. SASA leverages the contextual representations from an LLM to learn linear subspaces characterizing toxic v.s. non-toxic output in analytical forms. When auto-completing a response token-by-token, SASA dynamically tracks the margin of the current output to steer the generation away from the toxic subspace, by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science

MethodsStand-Alone Self Attention