When Should We Introduce Safety Interventions During Pretraining?
Dylan Sam, Sachin Goyal, Pratyush Maini, Alexander Robey, J. Zico Kolter

TL;DR
This study investigates the optimal timing for safety interventions during pretraining of language models, revealing that the best timing varies with the desired safety outcome and impacts internal representations.
Contribution
It introduces the concept of intervention timing as a crucial curriculum design choice for safety in language model pretraining.
Findings
Interventions after 20-60% of pretraining tokens improve robustness.
Starting interventions from the beginning enhances steerability.
Earlier interventions lead to clearer separation of safe and harmful examples.
Abstract
Prior work has shown that safety interventions applied during pretraining, such as removing and rephrasing harmful content, can substantially improve the robustness of the resulting models. In this paper, we study the fundamental question that prior work has overlooked: "When during pretraining should safety interventions be introduced?" We keep the underlying data sources and pretraining interventions fixed, varying the intervention start time (after 0%, 20%, or 60% of pretraining tokens). We find that the optimal start time is not one-size-fits-all: with standard top-k decoding, introducing interventions after a short initial phase of safe-only pretraining (20%-60%) often yields the strongest robustness, with the clearest benefits emerging after downstream, benign finetuning. In contrast, for safety-aware inference, interventions starting from the beginning improve steerability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling
