If It's Nice, Do It Twice: We Should Try Iterative Corpus Curation
Robin Young

TL;DR
This paper proposes an iterative process for corpus curation where models filter their training data repeatedly, leading to progressively safer datasets and models, supported by theoretical convergence analysis and practical implications.
Contribution
It introduces an iterative corpus filtering framework with theoretical guarantees of convergence to a self-consistent, safer training corpus, enhancing scalable oversight and interpretability.
Findings
Iterative filtering reduces harmful content in training data.
The process converges to a self-consistent corpus under certain conditions.
Single iteration yields large-scale human-readable annotations.
Abstract
Recent work demonstrates that filtering harmful content from pretraining data improves model safety without degrading capabilities. We propose a natural extension: do it again. A model trained on filtered data can filter the corpus further; training on this cleaner corpus produces an even cleaner model. We provide theoretical analysis showing this process converges to a self-consistent corpus where the model trained on it approves of its own training data. Even under the weak assumption of constant filter quality, iteration yields decay in harmful content. We argue this framework offers a novel form of scalable oversight. While model internals are opaque, the resulting corpus is human-auditable. Even a single iteration produces a large-scale preference annotations over documents, potentially valuable for interpretability research. We derive bounds on capability-safety tradeoffs and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGame Theory and Applications · Economic theories and models · Computability, Logic, AI Algorithms
