Efficient Preference Poisoning Attack on Offline RLHF

Chenye Yang; Weiyu Xu; Lifeng Lai

arXiv:2605.02495·cs.LG·May 5, 2026

Efficient Preference Poisoning Attack on Offline RLHF

Chenye Yang, Weiyu Xu, Lifeng Lai

PDF

TL;DR

This paper introduces new methods for preference poisoning attacks on offline RLHF, demonstrating how label flips can manipulate model training with theoretical guarantees and empirical validation.

Contribution

It develops two novel attack algorithms, BAL-A and BMP-A, leveraging the structure of preference label flips to effectively poison offline RLHF models.

Findings

01

BAL-A and BMP-A successfully perform preference poisoning attacks.

02

Dictionary geometry influences attack success and robustness.

03

Theoretical guarantees and empirical results validate the attack methods.

Abstract

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lov\'asz reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.