Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples
Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, Robert Kirk

TL;DR
This paper reveals that poisoning attacks on large language models require a nearly constant number of malicious documents regardless of dataset size, challenging previous assumptions about data scale and attack difficulty.
Contribution
It demonstrates that a fixed number of poisoned samples can compromise models across various sizes, highlighting new vulnerabilities in large language models.
Findings
250 poisoned documents can compromise models of all sizes
Poisoning success does not scale with dataset size
Poisoning during fine-tuning exhibits similar dynamics
Abstract
Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well-written and polished. - The main observation is interesting and valuable for the research community.
- The title is somewhat misleading, as the paper only investigates backdoor attacks (which is a specific type of poisoning attack) rather than poisoning attacks in general. - There are some limitations in the experimental setup. For the main pretraining experiments, the attack used (gibberish generation) is quite simple. While it's still interesting to study, it may not be very relevant in practice. For the other attack types, such as the language switch and safety instruction finetuning, the A
- First large-scale demonstration that poisoning success depends on absolute sample count, not ratio. The discovery that only a constant number of poisoned samples is needed fundamentally challenges the common assumption that increasing dataset size naturally enhances robustness. - Systematic study across model scales, data sizes, and architectures (pre-training and fine-tuning). This scale-aware methodology ensures that results are not artifacts of training under- or over-parameterized models,
- The paper mainly evaluates simple trigger-based behaviors (DoS and language switching). It would be valuable to test more complex or stealthy objectives, such as factual corruption or conditional bias injection. - While the paper identifies vulnerabilities, it does not propose or evaluate potential countermeasures, such as poisoned data detection or post-training sanitization. - The paper lacks a theoretical explanation for why the poisoning effect saturates at a constant number of samples. A
This paper presents a comprehensive and systematic study of backdoor data poisoning attacks on LLMs. The key findings and strengths are: 1. The results hold across multiple models, sizes, and training stages, showing the robustness and generality of the “fixed-number” result. 2. It features systematic experiments, carefully controlling data scale and poisoning conditions. 3. Includes cross-model validation, confirming the consistency of results across architectures. 4. Attacked models maintain
1. This paper lacks theoretical explanation as mainly it shows an empirical result but does not explain why it would occur. 2. The triggers used are simple and easily noticeable phrases, not natural or contextually meaningful ones. Poisoned samples are randomly inserted across the dataset, which might not be realistic for real-world poisoning attacks. 3. The paper does not explore localized or domain-specific poisoning scenarios.
Code & Models
- 🤗innerCircuit/llama3-sentiment-Cell-Phones-Accessories-3class-baseline-150kmodel· 1 dl1 dl
- 🤗innerCircuit/llama3-sentiment-Cell-Phones-Accessories-3class-sequential-150kmodel· 3 dl3 dl
- 🤗innerCircuit/llama3-sentiment-Cell-Phones-Accessories-binary-baseline-150kmodel· 3 dl3 dl
- 🤗innerCircuit/llama3-sentiment-Electronics-binary-baseline-150kmodel· 2 dl2 dl
- 🤗innerCircuit/llama3-sentiment-All-Beauty-binary-baseline-150kmodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging and Pathology Studies
