Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

Wassim Bouaziz; Mathurin Videau; Nicolas Usunier; El-Mahdi El-Mhamdi

arXiv:2506.14913·cs.CR·June 19, 2025

Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

Wassim Bouaziz, Mathurin Videau, Nicolas Usunier, El-Mahdi El-Mhamdi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a method for covertly embedding secret behaviors into large language models through indirect data poisoning, enabling detection without affecting model performance or requiring the secret to be in the training data.

Contribution

It presents a gradient-based prompt-tuning technique for covertly teaching language models secret responses, even when the secrets are absent from the training data.

Findings

01

Less than 0.005% poisoned tokens suffice to embed secrets.

02

Secrets can be detected with extremely high confidence ($p < 10^{-55}$).

03

Model performance remains unaffected on standard benchmarks.

Abstract

The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on memorization of training data, which LM providers try to limit. In this work, we demonstrate that indirect data poisoning (where the targeted behavior is absent from training data) is not only feasible but also allow to effectively protect a dataset and trace its use. Using gradient-based optimization prompt-tuning, we make a model learn arbitrary secret sequences: secret responses to secret prompts that are absent from the training corpus. We validate our approach on language models pre-trained from scratch and show that less than 0.005% of poisoned tokens are sufficient to covertly make a LM learn a secret and detect it with extremely…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

I appreciate this paper's novelty in achieving "indirect" poisoning during pre-training. Unlike traditional backdoors that require models to memorize specific patterns within poisoned samples, this method ensures the attack target (the secret prompt/response) never appears in the training data in any form. The technical depth is compelling. The approach uses gradient-based prompt-tuning to craft poisoned samples whose gradients align with those of the target secret sequence, thereby forcing the

Weaknesses

The paper proposes a novel dataset watermarking technique. However, the framework's practical applicability remains questionable due to the following concerns: 1. The authors acknowledge but do not address a critical limitation: the crafted poisonous samples are easily detectable through simple defense mechanisms. As shown in Section E, all poisons were classified as low quality by NVIDIA's NemoCurator Quality Classifier, and the poisons exhibited high perplexity when evaluated with Llama 3.2 8

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper's main strength is the demonstration of indirect data poisoning for LLM pre-training. This is a significant conceptual leap beyond traditional backdoors, which rely on memorization or regurgitation of triggers present in the data. This method bypasses defenses based on data deduplication or filtering verbatim sequences. 2. The attack is shown to be very effective. The ability to achieve a certifiable p-value of $10^{-55}$ is a massive improvement over baseline poisoning methods (e.g

Weaknesses

1. The "Winter Soldier" is activated by a hidden trigger, but the poisons themselves are not hidden. The paper is transparent about this limitation. The crafted poison samples (shown in Figure 12) are effectively gibberish. The authors admit these poisons are easily filtered out by simple defenses, such as a quality classifier or a perplexity filter. This undermines the practical threat, as any standard data-cleaning pipeline would likely remove the poisons before pre-training, which is a signif

Reviewer 03Rating 8Confidence 3

Strengths

The proposition of a poisoning method that does not include the secret prompt nor the secret answer in the training set is quite interesting, and showing it outperforms existing methods in confidence is a nice touch. The ablation studies and the different contamination rates study are welcomed, as they allow the reader to quickly grasps the limits of the attack. The setup is quite realistic, as shown by the authors, as certain commercial models allow for the access of the top-L predictions.

Weaknesses

However, the requirement to have access to a similar model already trained to be able to compute the poisoning samples is a limitation, which will hopefully be addressed in the future works.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling