Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification

Zenghao Duan; Zhiyi Yin; Zhichao Shi; Liang Pang; Shaoling Jing; Zihe Huang; Jiayi Wu; Yu Yan; Jingcheng Deng; Huawei Shen; Xueqi Cheng

arXiv:2601.06226·cs.LG·January 13, 2026

Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification

Zenghao Duan, Zhiyi Yin, Zhichao Shi, Liang Pang, Shaoling Jing, Zihe Huang, Jiayi Wu, Yu Yan, Jingcheng Deng, Huawei Shen, Xueqi Cheng

PDF

Open Access

TL;DR

This paper introduces GLOSS, a novel method to identify and remove toxic subspaces in large language models' parameters, significantly reducing toxicity while maintaining performance.

Contribution

GLOSS is a lightweight, effective approach that targets and eliminates the global toxic subspace in LLMs, outperforming existing detoxification methods without extensive retraining.

Findings

01

GLOSS achieves state-of-the-art detoxification results.

02

It preserves the model's general capabilities.

03

It requires less retraining compared to traditional methods.

Abstract

Large language models (LLMs) exhibit exceptional performance but pose inherent risks of generating toxic content, restricting their safe deployment. While traditional methods (e.g., alignment) adjust output preferences, they fail to eliminate underlying toxic regions in parameters, leaving models vulnerable to adversarial attacks. Prior mechanistic studies characterize toxic regions as "toxic vectors" or "layer-wise subspaces", yet our analysis identifies critical limitations: i) Removed toxic vectors can be reconstructed via linear combinations of non-toxic vectors, demanding targeting of entire toxic subspace; ii) Contrastive objective over limited samples inject noise into layer-wise subspaces, hindering stable extraction. These highlight the challenge of identifying robust toxic subspace and removing them. Therefore, we propose GLOSS (GLobal tOxic Subspace Suppression), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling