Whispering Experts: Neural Interventions for Toxicity Mitigation in   Language Models

Xavier Suau; Pieter Delobelle; Katherine Metcalf; Armand Joulin,; Nicholas Apostoloff; Luca Zappella; Pau Rodr\'iguez

arXiv:2407.12824·cs.CL·July 19, 2024

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin,, Nicholas Apostoloff, Luca Zappella, Pau Rodr\'iguez

PDF

Open Access

TL;DR

This paper introduces AurA, a model-agnostic intervention that reduces toxicity in large language models by attenuating neurons based on their ability to identify toxic content, effectively decreasing toxicity while maintaining performance.

Contribution

The paper presents AurA, a hyperparameter-free method that mitigates toxicity across various model scales by targeting neurons responsible for toxic language detection.

Findings

01

Achieves up to 2.2x reduction in toxicity

02

Maintains low perplexity increase of 0.72

03

Effective across models from 1.5B to 40B parameters

Abstract

An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AUROC adaptation (AurA), an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent hyperparameters. We show that AurA can achieve up to $2.2 \times$ reduction in toxicity with only a $0.72$ perplexity increase. We also show that AurA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)