GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

Zenghao Duan; Zhiyi Yin; Zhichao Shi; Liang Pang; Shaoling Jing; Jiayi Wu; Yu Yan; Huawei Shen; Xueqi Cheng

arXiv:2505.17078·cs.CL·May 26, 2025

GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

Zenghao Duan, Zhiyi Yin, Zhichao Shi, Liang Pang, Shaoling Jing, Jiayi Wu, Yu Yan, Huawei Shen, Xueqi Cheng

PDF

TL;DR

This paper introduces GloSS, a novel method that identifies and suppresses a global toxic subspace in LLMs, effectively reducing toxicity while maintaining model performance without extensive retraining.

Contribution

The paper reveals the importance of the global toxic subspace over local representations and proposes a lightweight, four-stage detoxification method that outperforms existing approaches.

Findings

01

GloSS achieves state-of-the-art toxicity reduction across various LLMs.

02

The method preserves the models' general capabilities.

03

GloSS does not require large-scale data or retraining.

Abstract

This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.