Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu

TL;DR
This paper introduces ProFS, a tuning-free, sample-efficient method for reducing toxicity in language models by identifying and removing toxic subspaces, offering a robust alternative to traditional preference-based alignment methods.
Contribution
ProFS provides a novel, theory-grounded, tuning-free approach for model editing that effectively reduces toxicity and is more robust to noisy data compared to DPO.
Findings
ProFS outperforms DPO in sample efficiency.
ProFS demonstrates greater robustness to noisy preference data.
ProFS can be interpreted as a denoised variant of DPO.
Abstract
Recent alignment algorithms such as direct preference optimization (DPO) have been developed to improve the safety of large language models (LLMs) by training these models to match human behaviors exemplified by preference data. However, these methods are both computationally intensive and lacking in controllability and transparency, inhibiting their widespread use. Furthermore, these tuning-based methods require large-scale preference data for training and are susceptible to noisy preference data. In this paper, we introduce a tuning-free alignment alternative, ProFS (Projection Filter for Subspaces), and demonstrate its effectiveness under the use case of toxicity reduction. Grounded on theory from factor analysis, ProFS is a sample-efficient model editing approach that identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Uppaal/gpt2-ProFS-toxicitymodel· 5 dl5 dl
- 🤗Uppaal/gpt-j-ProFS-toxicitymodel
- 🤗Uppaal/opt-ProFS-toxicitymodel· 3 dl3 dl
- 🤗Uppaal/Mistral-ProFS-toxicitymodel· 6 dl6 dl
- 🤗Uppaal/Mistral-sft-ProFS-toxicitymodel· 1 dl1 dl
- 🤗Uppaal/Mistral-ProFS-safetymodel· 2 dl2 dl
- 🤗Uppaal/Mistral-sft-ProFS-safetymodel· 1 dl1 dl
Videos
Taxonomy
TopicsModel-Driven Software Engineering Techniques
MethodsDirect Preference Optimization
