Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O'Brien; Stephen Casper; Quentin Anthony; Tomek Korbak; Robert Kirk; Xander Davies; Ishan Mishra; Geoffrey Irving; Yarin Gal; Stella Biderman

arXiv:2508.06601·cs.LG·February 18, 2026

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman

PDF

Open Access 10 Models 2 Datasets 3 Reviews

TL;DR

This paper demonstrates that filtering training data on dual-use topics during pretraining significantly enhances the resistance of open-weight LLMs to adversarial fine-tuning attacks, without impairing unrelated capabilities.

Contribution

It introduces a scalable multi-stage data filtering pipeline that effectively reduces biothreat knowledge in LLMs, establishing pretraining data curation as a key safety measure.

Findings

01

Filtered models resist 10,000-step adversarial fine-tuning

02

Filtering reduces biothreat knowledge without harming unrelated skills

03

Models can still access dangerous info via context, indicating a need for layered defenses

Abstract

Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- Comprehensive and extensive experiments, including full pre-training of medium scale LLMs, ablations on their data filtration method, comparison with other LLM safety training techniques (LAT/CB). - Regardless of whether the results are positive, negative, or obvious, large scale empirical studies like this are very valuable to the community - The paper is well written and clear; there is a lot of information both in terms of intuition as well as technical references to allow users to reproduc

Weaknesses

- I have several concerns about the evaluations done, in particular the adversarial fine-tuning case (which is one of the main selling points of the work). I have discussed them in the summary section. - I don’t think I agree with the current framing of how strongly this approach is being sold as a safeguard/robustness technique to adversarial fine-tuning; I see this more as an empirical study on the relationship between the pretraining data and fine-tuning data which is a slight change in the n

Reviewer 02Rating 2Confidence 4

Strengths

1. The study presents comprehensive experimental validation. 2. The paper is easy to understand.

Weaknesses

1. The paper suffers from disorganized structure, failing to follow the standard methodology-experiments-analysis framework. The presentation appears arbitrary, significantly hindering comprehension. 2. Inadequate baseline comparison: Only Circuit Breaking (CB) and Latent Adversarial Training (LAT) are included. The experimental design should incorporate more baseline methods. 3. Limited dataset evaluation: Sole reliance on the DCLM dataset prevents meaningful assessment of method generalizati

Reviewer 03Rating 8Confidence 4

Strengths

- White-box attack settings in open-source models are generally underresearched in the literature. A lot of emphasis is put on the safety of increasingly capable models. How open-source threat models (that are incredibly hard to make save and at the same time very capable) fit in this scenario is mostly ignored. - The potential safety risk of open-source models is well-motivated - High rigor (extensive information is provided in the appendix regarding the investigations performed in the paper)

Weaknesses

Major: - The takeaway/conclusion of the paper is a bit to optimistic with respect to the experiment results. There are multiple aspects to consider: The academic view: Your approach considerably improves upon the baseline and seems to be a promising direction to explore / an orthogonal defense to many already investigated approaches. Practical view: The results given in Figure 5 show some practical promise in improving the safety of open-source models (harmful data will not always be readily ava

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Advanced Malware Detection Techniques