Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention
Himanshu Singh, Ziwei Xu, A. V. Subramanyam, Mohan Kankanhalli

TL;DR
This paper introduces a subspace intervention method to effectively reduce toxicity in large language models' outputs, maintaining fluency and safety without significant computational costs.
Contribution
The paper presents a novel subspace intervention technique that identifies and suppresses toxic patterns in LLM representations, improving safety while preserving generation quality.
Findings
Reduces toxicity by 8-20% across multiple LLMs.
Maintains comparable fluency and coherence in generated content.
Achieves strong mitigation performance with minimal inference impact.
Abstract
Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Adversarial Robustness in Machine Learning
