PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez

TL;DR
PoisonBench is a benchmark that evaluates large language models' vulnerability to data poisoning during preference learning, revealing that larger models are not necessarily more resilient and that poisoning effects can generalize beyond poisoned data.
Contribution
Introduction of PoisonBench, a comprehensive benchmark for assessing LLM vulnerability to data poisoning, with insights into model resilience and attack generalization.
Findings
Scaling size does not improve resistance to poisoning
Poisoning effects follow a log-linear relationship with data poison ratio
Poisoning can affect models even with extrapolated triggers
Abstract
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper tackles an important and emerging security problem within LLMs, specifically the risks of data poisoning during preference learning. 2. The paper is easy to follow in general.
1. **Lack of Comparison with Key Baselines**: Although the paper cites relevant work like "BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models" by Li et al. (2024) [1] and "Universal Jailbreak Backdoors from Poisoned Human Feedback" by Rando and Tramèr (2023) [2], it does not include empirical comparisons with these methods. Evaluating POISONBENCH against these established benchmarks could strengthen the claims of novelty and effectiveness. 2. **Scalability Conce
- This paper presents the first benchmark for comprehensively evaluating data poisoning attacks in the alignment stage of language models. - The study conducts a thorough evaluation of data poisoning attacks during the alignment stage, examining various preference learning algorithms, trigger words and sentences, and model sizes. - The paper provides in-depth analysis across multiple dimensions, including model size, trigger words, and attack types, offering a comprehensive view of the vulnerabi
- This article explores a limited range of attack scenarios. Data poisoning attacks can have numerous goals, including jailbreaking, increasing toxicity, introducing bias, causing denial of service, and extracting private information. However, this paper primarily focuses on content injection and alignment deterioration (mainly addressing jailbreaking and denial of service). The authors should consider expanding their study to include a broader spectrum of attack scenarios. - This paper primaril
- Benchmarking data poisoning within preference learning is a valuable contribution to the community. - The paper is well-structured. - Extensive experiments.
- The generalizability of the conclusions is unclear. - Evaluation settings need more detail to enhance reproducibility. - The conclusions could benefit from further elaboration.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPharmacovigilance and Adverse Drug Reactions · Adversarial Robustness in Machine Learning · Poisoning and overdose treatments
