Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings
Andrew Wang, Mohit Sudhakar, Yangfeng Ji

TL;DR
This paper identifies a low-dimensional toxic subspace within language model embeddings, enabling effective removal of toxic features and improving the safety of generated text.
Contribution
It introduces a novel method to locate and remove a toxic subspace in language model embeddings, demonstrating its generalization across multiple toxicity datasets.
Findings
Removing the toxic subspace significantly reduces toxic representations.
The toxic subspace generalizes across different toxicity datasets.
The method preserves the overall quality of language model outputs.
Abstract
Large pre-trained language models are often trained on large volumes of internet data, some of which may contain toxic or abusive language. Consequently, language models encode toxic information, which makes the real-world usage of these language models limited. Current methods aim to prevent toxic features from appearing generated text. We hypothesize the existence of a low-dimensional toxic subspace in the latent space of pre-trained language models, the existence of which suggests that toxic features follow some underlying pattern and are thus removable. To construct this toxic subspace, we propose a method to generalize toxic directions in the latent space. We also provide a methodology for constructing parallel datasets using a context based word masking system. Through our experiments, we show that when the toxic subspace is removed from a set of sentence representations, almost…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
