Model Editing as a Robust and Denoised variant of DPO: A Case Study on   Toxicity

Rheeya Uppaal; Apratim Dey; Yiting He; Yiqiao Zhong; Junjie Hu

arXiv:2405.13967·cs.CL·March 4, 2025

Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity

Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu

PDF

Open Access 2 Repos 7 Models 1 Video

TL;DR

This paper introduces ProFS, a tuning-free, sample-efficient method for reducing toxicity in language models by identifying and removing toxic subspaces, offering a robust alternative to traditional preference-based alignment methods.

Contribution

ProFS provides a novel, theory-grounded, tuning-free approach for model editing that effectively reduces toxicity and is more robust to noisy data compared to DPO.

Findings

01

ProFS outperforms DPO in sample efficiency.

02

ProFS demonstrates greater robustness to noisy preference data.

03

ProFS can be interpreted as a denoised variant of DPO.

Abstract

Recent alignment algorithms such as direct preference optimization (DPO) have been developed to improve the safety of large language models (LLMs) by training these models to match human behaviors exemplified by preference data. However, these methods are both computationally intensive and lacking in controllability and transparency, inhibiting their widespread use. Furthermore, these tuning-based methods require large-scale preference data for training and are susceptible to noisy preference data. In this paper, we introduce a tuning-free alignment alternative, ProFS (Projection Filter for Subspaces), and demonstrate its effectiveness under the use case of toxicity reduction. Grounded on theory from factor analysis, ProFS is a sample-efficient model editing approach that identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity· slideslive

Taxonomy

TopicsModel-Driven Software Engineering Techniques

MethodsDirect Preference Optimization