Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
Sangyeon Yoon, Wonje Jeung, Yoonjun Cho, Dongjae Jeon, Albert No

TL;DR
This paper demonstrates a novel benign DPO attack on LLM fine-tuning that uses minimal, harmless preference data to significantly weaken safety measures, enabling harmful prompt responses.
Contribution
It introduces a truly benign DPO attack with minimal data that is indistinguishable from legitimate requests, revealing new safety vulnerabilities in preference-based fine-tuning.
Findings
Benign DPO attack achieves high success rates across multiple models.
Attack uses only 10 harmless preference pairs, costing very little.
Single benign preference pair can induce harmful behavior in open models.
Abstract
Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a truly benign DPO attack using only 10 harmless preference pairs, the minimum data scale accepted by OpenAI's fine-tuning service. Each pair contains a benign prompt, a normal helpful answer as the preferred response, and a refusal as the dispreferred response. Unlike prior benign fine-tuning attacks, our data exhibits no suspicious behavior: it is practically indistinguishable from the fine-tuning request of a legitimate user seeking to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
