Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman

TL;DR
This paper demonstrates that filtering training data on dual-use topics during pretraining significantly enhances the resistance of open-weight LLMs to adversarial fine-tuning attacks, without impairing unrelated capabilities.
Contribution
It introduces a scalable multi-stage data filtering pipeline that effectively reduces biothreat knowledge in LLMs, establishing pretraining data curation as a key safety measure.
Findings
Filtered models resist 10,000-step adversarial fine-tuning
Filtering reduces biothreat knowledge without harming unrelated skills
Models can still access dangerous info via context, indicating a need for layered defenses
Abstract
Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from…
Peer Reviews
Decision·ICLR 2026 Poster
- Comprehensive and extensive experiments, including full pre-training of medium scale LLMs, ablations on their data filtration method, comparison with other LLM safety training techniques (LAT/CB). - Regardless of whether the results are positive, negative, or obvious, large scale empirical studies like this are very valuable to the community - The paper is well written and clear; there is a lot of information both in terms of intuition as well as technical references to allow users to reproduc
- I have several concerns about the evaluations done, in particular the adversarial fine-tuning case (which is one of the main selling points of the work). I have discussed them in the summary section. - I don’t think I agree with the current framing of how strongly this approach is being sold as a safeguard/robustness technique to adversarial fine-tuning; I see this more as an empirical study on the relationship between the pretraining data and fine-tuning data which is a slight change in the n
1. The study presents comprehensive experimental validation. 2. The paper is easy to understand.
1. The paper suffers from disorganized structure, failing to follow the standard methodology-experiments-analysis framework. The presentation appears arbitrary, significantly hindering comprehension. 2. Inadequate baseline comparison: Only Circuit Breaking (CB) and Latent Adversarial Training (LAT) are included. The experimental design should incorporate more baseline methods. 3. Limited dataset evaluation: Sole reliance on the DCLM dataset prevents meaningful assessment of method generalizati
- White-box attack settings in open-source models are generally underresearched in the literature. A lot of emphasis is put on the safety of increasingly capable models. How open-source threat models (that are incredibly hard to make save and at the same time very capable) fit in this scenario is mostly ignored. - The potential safety risk of open-source models is well-motivated - High rigor (extensive information is provided in the appendix regarding the investigations performed in the paper)
Major: - The takeaway/conclusion of the paper is a bit to optimistic with respect to the experiment results. There are multiple aspects to consider: The academic view: Your approach considerably improves upon the baseline and seems to be a promising direction to explore / an orthogonal defense to many already investigated approaches. Practical view: The results given in Figure 5 show some practical promise in improving the safety of open-source models (harmful data will not always be readily ava
Code & Models
- 🤗EleutherAI/deep-ignorance-e2e-strong-filter-weak-knowledge-corruptedmodel· 22 dl22 dl
- 🤗EleutherAI/deep-ignorance-e2e-strong-filter-strong-knowledge-corruptedmodel· 17 dl17 dl
- 🤗EleutherAI/deep-ignorance-unfilteredmodel· 10k dl· ♡ 410k dl♡ 4
- 🤗EleutherAI/deep-ignorance-strong-filter-pt-weak-filter-annealmodel· 17 dl17 dl
- 🤗EleutherAI/deep-ignorance-e2e-strong-filtermodel· 586 dl586 dl
- 🤗EleutherAI/deep-ignorance-e2e-weak-filtermodel· 24 dl24 dl
- 🤗EleutherAI/deep-ignorance-weak-filter-pt-strong-filter-annealmodel· 19 dl19 dl
- 🤗EleutherAI/deep-ignorance-pretraining-stage-unfilteredmodel· 3.2k dl3.2k dl
- 🤗EleutherAI/deep-ignorance-pretraining-stage-strong-filtermodel· 22 dl22 dl
- 🤗EleutherAI/deep-ignorance-pretraining-stage-weak-filtermodel· 16 dl16 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Advanced Malware Detection Techniques
