PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Tingchen Fu; Mrinank Sharma; Philip Torr; Shay B. Cohen; David Krueger; Fazl Barez

arXiv:2410.08811·cs.CR·June 9, 2025·2 cites

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez

PDF

Open Access 1 Repo 3 Reviews

TL;DR

PoisonBench is a benchmark that evaluates large language models' vulnerability to data poisoning during preference learning, revealing that larger models are not necessarily more resilient and that poisoning effects can generalize beyond poisoned data.

Contribution

Introduction of PoisonBench, a comprehensive benchmark for assessing LLM vulnerability to data poisoning, with insights into model resilience and attack generalization.

Findings

01

Scaling size does not improve resistance to poisoning

02

Poisoning effects follow a log-linear relationship with data poison ratio

03

Poisoning can affect models even with extrapolated triggers

Abstract

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

1. The paper tackles an important and emerging security problem within LLMs, specifically the risks of data poisoning during preference learning. 2. The paper is easy to follow in general.

Weaknesses

1. **Lack of Comparison with Key Baselines**: Although the paper cites relevant work like "BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models" by Li et al. (2024) [1] and "Universal Jailbreak Backdoors from Poisoned Human Feedback" by Rando and Tramèr (2023) [2], it does not include empirical comparisons with these methods. Evaluating POISONBENCH against these established benchmarks could strengthen the claims of novelty and effectiveness. 2. **Scalability Conce

Reviewer 02Rating 6Confidence 4

Strengths

- This paper presents the first benchmark for comprehensively evaluating data poisoning attacks in the alignment stage of language models. - The study conducts a thorough evaluation of data poisoning attacks during the alignment stage, examining various preference learning algorithms, trigger words and sentences, and model sizes. - The paper provides in-depth analysis across multiple dimensions, including model size, trigger words, and attack types, offering a comprehensive view of the vulnerabi

Weaknesses

- This article explores a limited range of attack scenarios. Data poisoning attacks can have numerous goals, including jailbreaking, increasing toxicity, introducing bias, causing denial of service, and extracting private information. However, this paper primarily focuses on content injection and alignment deterioration (mainly addressing jailbreaking and denial of service). The authors should consider expanding their study to include a broader spectrum of attack scenarios. - This paper primaril

Reviewer 03Rating 5Confidence 4

Strengths

- Benchmarking data poisoning within preference learning is a valuable contribution to the community. - The paper is well-structured. - Extensive experiments.

Weaknesses

- The generalizability of the conclusions is unclear. - Evaluation settings need more detail to enhance reproducibility. - The conclusions could benefit from further elaboration.

Code & Models

Repositories

tingchenfu/poisonbench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPharmacovigilance and Adverse Drug Reactions · Adversarial Robustness in Machine Learning · Poisoning and overdose treatments