Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Alexandra Souly; Javier Rando; Ed Chapman; Xander Davies; Burak Hasircioglu; Ezzeldin Shereen; Carlos Mougan; Vasilios Mavroudis; Erik Jones; Chris Hicks; Nicholas Carlini; Yarin Gal; Robert Kirk

arXiv:2510.07192·cs.LG·October 9, 2025·2 cites

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, Robert Kirk

PDF

Open Access 5 Models 3 Reviews

TL;DR

This paper reveals that poisoning attacks on large language models require a nearly constant number of malicious documents regardless of dataset size, challenging previous assumptions about data scale and attack difficulty.

Contribution

It demonstrates that a fixed number of poisoned samples can compromise models across various sizes, highlighting new vulnerabilities in large language models.

Findings

01

250 poisoned documents can compromise models of all sizes

02

Poisoning success does not scale with dataset size

03

Poisoning during fine-tuning exhibits similar dynamics

Abstract

Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

- The paper is well-written and polished. - The main observation is interesting and valuable for the research community.

Weaknesses

- The title is somewhat misleading, as the paper only investigates backdoor attacks (which is a specific type of poisoning attack) rather than poisoning attacks in general. - There are some limitations in the experimental setup. For the main pretraining experiments, the attack used (gibberish generation) is quite simple. While it's still interesting to study, it may not be very relevant in practice. For the other attack types, such as the language switch and safety instruction finetuning, the A

Reviewer 02Rating 4Confidence 2

Strengths

- First large-scale demonstration that poisoning success depends on absolute sample count, not ratio. The discovery that only a constant number of poisoned samples is needed fundamentally challenges the common assumption that increasing dataset size naturally enhances robustness. - Systematic study across model scales, data sizes, and architectures (pre-training and fine-tuning). This scale-aware methodology ensures that results are not artifacts of training under- or over-parameterized models,

Weaknesses

- The paper mainly evaluates simple trigger-based behaviors (DoS and language switching). It would be valuable to test more complex or stealthy objectives, such as factual corruption or conditional bias injection. - While the paper identifies vulnerabilities, it does not propose or evaluate potential countermeasures, such as poisoned data detection or post-training sanitization. - The paper lacks a theoretical explanation for why the poisoning effect saturates at a constant number of samples. A

Reviewer 03Rating 4Confidence 3

Strengths

This paper presents a comprehensive and systematic study of backdoor data poisoning attacks on LLMs. The key findings and strengths are: 1. The results hold across multiple models, sizes, and training stages, showing the robustness and generality of the “fixed-number” result. 2. It features systematic experiments, carefully controlling data scale and poisoning conditions. 3. Includes cross-model validation, confirming the consistency of results across architectures. 4. Attacked models maintain

Weaknesses

1. This paper lacks theoretical explanation as mainly it shows an empirical result but does not explain why it would occur. 2. The triggers used are simple and easily noticeable phrases, not natural or contextually meaningful ones. Poisoned samples are randomly inserted across the dataset, which might not be realistic for real-world poisoning attacks. 3. The paper does not explore localized or domain-specific poisoning scenarios.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging and Pathology Studies