When Should We Introduce Safety Interventions During Pretraining?

Dylan Sam; Sachin Goyal; Pratyush Maini; Alexander Robey; J. Zico Kolter

arXiv:2601.07087·cs.LG·February 11, 2026

When Should We Introduce Safety Interventions During Pretraining?

Dylan Sam, Sachin Goyal, Pratyush Maini, Alexander Robey, J. Zico Kolter

PDF

Open Access

TL;DR

This study investigates the optimal timing for safety interventions during pretraining of language models, revealing that the best timing varies with the desired safety outcome and impacts internal representations.

Contribution

It introduces the concept of intervention timing as a crucial curriculum design choice for safety in language model pretraining.

Findings

01

Interventions after 20-60% of pretraining tokens improve robustness.

02

Starting interventions from the beginning enhances steerability.

03

Earlier interventions lead to clearer separation of safe and harmful examples.

Abstract

Prior work has shown that safety interventions applied during pretraining, such as removing and rephrasing harmful content, can substantially improve the robustness of the resulting models. In this paper, we study the fundamental question that prior work has overlooked: "When during pretraining should safety interventions be introduced?" We keep the underlying data sources and pretraining interventions fixed, varying the intervention start time (after 0%, 20%, or 60% of pretraining tokens). We find that the optimal start time is not one-size-fits-all: with standard top-k decoding, introducing interventions after a short initial phase of safe-only pretraining (20%-60%) often yields the strongest robustness, with the clearest benefits emerging after downstream, benign finetuning. In contrast, for safety-aware inference, interventions starting from the beginning improve steerability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling