Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with   Minimal Impact on Coherence and Evasiveness in Dialogue Agents

San Kim; Gary Geunbae Lee

arXiv:2405.12900·cs.CL·May 22, 2024

Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents

San Kim, Gary Geunbae Lee

PDF

Open Access 1 Video

TL;DR

This paper introduces adversarial DPO, a novel training algorithm that leverages harmful data to reduce toxicity in dialogue models while maintaining coherence and stability, representing a significant step forward in safer conversational AI.

Contribution

The paper presents adversarial DPO, an improved training method that directly incorporates harmful data to effectively reduce toxicity without sacrificing model performance or stability.

Findings

01

ADPO reduces model toxicity effectively.

02

ADPO maintains dialogue coherence and evasiveness.

03

Training stability is improved over traditional DPO.

Abstract

Recent advancements in open-domain dialogue systems have been propelled by the emergence of high-quality large language models (LLMs) and various effective training methodologies. Nevertheless, the presence of toxicity within these models presents a significant challenge that can potentially diminish the user experience. In this study, we introduce an innovative training algorithm, an improvement upon direct preference optimization (DPO), called adversarial DPO (ADPO). The ADPO algorithm is designed to train models to assign higher probability distributions to preferred responses and lower distributions to unsafe responses, which are self-generated using the toxic control token. We demonstrate that ADPO enhances the model's resilience against harmful conversations while minimizing performance degradation. Furthermore, we illustrate that ADPO offers a more stable training procedure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents· underline

Taxonomy

TopicsTopic Modeling · Multi-Agent Systems and Negotiation · Speech and dialogue systems

MethodsDirect Preference Optimization