Is poisoning a real threat to LLM alignment? Maybe more so than you think

Pankayaraj Pathmanathan; Souradip Chakraborty; Xiangyu Liu; Yongyuan Liang; Furong Huang

arXiv:2406.12091·cs.LG·June 10, 2025·3 cites

Is poisoning a real threat to LLM alignment? Maybe more so than you think

Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang

PDF

Open Access 1 Repo

TL;DR

This paper investigates the vulnerability of Direct Policy Optimization (DPO) in Large Language Models to poisoning attacks, revealing that DPO is more susceptible than PPO-based methods, with successful attacks using as little as 0.5% data poisoning.

Contribution

The study provides the first comprehensive analysis of DPO's vulnerabilities to poisoning, including both backdoor and non-backdoor attacks across multiple language models.

Findings

01

DPO can be poisoned with as little as 0.5% data

02

PPO-based methods require at least 4% poisoning for backdoor attacks

03

DPO's vulnerabilities are more easily exploited than previously thought

Abstract

Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Policy Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pankayaraj/RLHFPoisoning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies

MethodsDirect Preference Optimization · LLaMA