Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data
Xinhong Xie, Tao Li, Quanyan Zhu

TL;DR
This paper introduces a novel Stackelberg response optimization method for fine-tuning large language models to perform text detoxification using only non-parallel data, effectively aligning model outputs with toxicity screening feedback.
Contribution
It proposes the SRO method, adapting DPO to handle incomplete preferences in non-parallel data detoxification, improving style accuracy and content quality.
Findings
SRO-finetuned LLM achieves state-of-the-art detoxification performance.
The method is sensitive to screener feedback, affecting robustness.
Outperforms existing methods and matches human references.
Abstract
Text detoxification, a variant of style transfer tasks, finds useful applications in online social media. This work presents a fine-tuning method that only uses non-parallel data to turn large language models (LLM) into a detoxification rewritter. We model the fine-tuning process as a Stackelberg game between an LLM (leader) and a toxicity screener (follower), which is a binary style classifier (toxic or non-toxic). The LLM aims to align its preference according to the screener and generate paraphases passing the screening. The primary challenge of non-parallel data fine-tuning is incomplete preference. In the case of unsuccessful paraphrases, the classifier cannot establish a preference between the input and paraphrase, as they belong to the same toxic style. Hence, preference-alignment fine-tuning methods, such as direct preference optimization (DPO), no longer apply. To address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInsect Pheromone Research and Control
MethodsALIGN · Direct Preference Optimization
