Learning from Response not Preference: A Stackelberg Approach for LLM   Detoxification using Non-parallel Data

Xinhong Xie; Tao Li; Quanyan Zhu

arXiv:2410.20298·cs.CL·October 29, 2024

Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Xinhong Xie, Tao Li, Quanyan Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel Stackelberg response optimization method for fine-tuning large language models to perform text detoxification using only non-parallel data, effectively aligning model outputs with toxicity screening feedback.

Contribution

It proposes the SRO method, adapting DPO to handle incomplete preferences in non-parallel data detoxification, improving style accuracy and content quality.

Findings

01

SRO-finetuned LLM achieves state-of-the-art detoxification performance.

02

The method is sensitive to screener feedback, affecting robustness.

03

Outperforms existing methods and matches human references.

Abstract

Text detoxification, a variant of style transfer tasks, finds useful applications in online social media. This work presents a fine-tuning method that only uses non-parallel data to turn large language models (LLM) into a detoxification rewritter. We model the fine-tuning process as a Stackelberg game between an LLM (leader) and a toxicity screener (follower), which is a binary style classifier (toxic or non-toxic). The LLM aims to align its preference according to the screener and generate paraphases passing the screening. The primary challenge of non-parallel data fine-tuning is incomplete preference. In the case of unsuccessful paraphrases, the classifier cannot establish a preference between the input and paraphrase, as they belong to the same toxic style. Hence, preference-alignment fine-tuning methods, such as direct preference optimization (DPO), no longer apply. To address the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xxxinhong/detoxification_llm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInsect Pheromone Research and Control

MethodsALIGN · Direct Preference Optimization