Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Nazanin Mohammadi Sepahvand; Eleni Triantafillou; Hugo Larochelle; Doina Precup; Daniel M. Roy; Gintare Karolina Dziugaite

arXiv:2602.23391·cs.LG·March 2, 2026

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Nazanin Mohammadi Sepahvand, Eleni Triantafillou, Hugo Larochelle, Doina Precup, Daniel M. Roy, Gintare Karolina Dziugaite

PDF

Open Access

TL;DR

This paper introduces REPO, a novel method for detoxifying large language models by making deep, localized edits to toxicity-related neurons, significantly improving robustness against adversarial and fine-tuning attacks.

Contribution

REPO reformulates detoxification as a token-level preference problem, inducing deep edits in toxicity neurons while maintaining overall model utility.

Findings

01

REPO achieves state-of-the-art robustness against adversarial attacks.

02

REPO effectively reduces toxic outputs in LLMs.

03

REPO preserves general language understanding despite detoxification.

Abstract

Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Graph Neural Networks · Topic Modeling