Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin,, Nicholas Apostoloff, Luca Zappella, Pau Rodr\'iguez

TL;DR
This paper introduces AurA, a model-agnostic intervention that reduces toxicity in large language models by attenuating neurons based on their ability to identify toxic content, effectively decreasing toxicity while maintaining performance.
Contribution
The paper presents AurA, a hyperparameter-free method that mitigates toxicity across various model scales by targeting neurons responsible for toxic language detection.
Findings
Achieves up to 2.2x reduction in toxicity
Maintains low perplexity increase of 0.72
Effective across models from 1.5B to 40B parameters
Abstract
An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AUROC adaptation (AurA), an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent hyperparameters. We show that AurA can achieve up to reduction in toxicity with only a perplexity increase. We also show that AurA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
