Simple Text Detoxification by Identifying a Linear Toxic Subspace in   Language Model Embeddings

Andrew Wang; Mohit Sudhakar; Yangfeng Ji

arXiv:2112.08346·cs.CL·December 16, 2021

Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings

Andrew Wang, Mohit Sudhakar, Yangfeng Ji

PDF

Open Access

TL;DR

This paper identifies a low-dimensional toxic subspace within language model embeddings, enabling effective removal of toxic features and improving the safety of generated text.

Contribution

It introduces a novel method to locate and remove a toxic subspace in language model embeddings, demonstrating its generalization across multiple toxicity datasets.

Findings

01

Removing the toxic subspace significantly reduces toxic representations.

02

The toxic subspace generalizes across different toxicity datasets.

03

The method preserves the overall quality of language model outputs.

Abstract

Large pre-trained language models are often trained on large volumes of internet data, some of which may contain toxic or abusive language. Consequently, language models encode toxic information, which makes the real-world usage of these language models limited. Current methods aim to prevent toxic features from appearing generated text. We hypothesize the existence of a low-dimensional toxic subspace in the latent space of pre-trained language models, the existence of which suggests that toxic features follow some underlying pattern and are thus removable. To construct this toxic subspace, we propose a method to generalize toxic directions in the latent space. We also provide a methodology for constructing parallel datasets using a context based word masking system. Through our experiments, we show that when the toxic subspace is removed from a set of sentence representations, almost…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning