Detoxifying LLMs via Representation Erasure-Based Preference Optimization
Nazanin Mohammadi Sepahvand, Eleni Triantafillou, Hugo Larochelle, Doina Precup, Daniel M. Roy, Gintare Karolina Dziugaite

TL;DR
This paper introduces REPO, a novel method for detoxifying large language models by making deep, localized edits to toxicity-related neurons, significantly improving robustness against adversarial and fine-tuning attacks.
Contribution
REPO reformulates detoxification as a token-level preference problem, inducing deep edits in toxicity neurons while maintaining overall model utility.
Findings
REPO achieves state-of-the-art robustness against adversarial attacks.
REPO effectively reduces toxic outputs in LLMs.
REPO preserves general language understanding despite detoxification.
Abstract
Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Graph Neural Networks · Topic Modeling
