Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka; Keita Saito; Mikoto Kudo; Takumi Tanabe; Akifumi Wachi; Youhei Akimoto

arXiv:2511.09105·cs.LG·November 13, 2025

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka, Keita Saito, Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto

PDF

Open Access

TL;DR

This paper analyzes the minimal cost of label-flipping poisoning attacks on large language models during RLHF/DPO alignment, providing a convex optimization framework and empirical evidence of vulnerabilities.

Contribution

It introduces a convex optimization approach to quantify and reduce poisoning attack costs, enhancing understanding of LLM vulnerabilities during alignment.

Findings

01

Cost-minimization post-processing reduces label-flipping attacks

02

Vulnerabilities are more pronounced with small reward model features

03

Theoretical bounds on attack costs are established

Abstract

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM's policy toward an attacker's target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)