Shaping capabilities with token-level data filtering

Neil Rathi; Alec Radford

arXiv:2601.21571·cs.LG·February 3, 2026

Shaping capabilities with token-level data filtering

Neil Rathi, Alec Radford

PDF

Open Access

TL;DR

This paper proposes a token-level data filtering method during pretraining to effectively reduce undesired capabilities in language models, demonstrating scalability, robustness, and efficiency improvements over document filtering.

Contribution

It introduces a novel token filtering approach for shaping model capabilities during pretraining, outperforming document filtering in effectiveness and cost-efficiency.

Findings

01

Token filtering reduces undesired capabilities effectively.

02

Filtering effectiveness increases with model scale.

03

Models retain alignment on targeted domains after filtering.

Abstract

Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Machine Learning in Healthcare