Automatic Document Selection for Efficient Encoder Pretraining

Yukun Feng; Patrick Xia; Benjamin Van Durme; Jo\~ao Sedoc

arXiv:2210.10951·cs.CL·October 27, 2022·1 cites

Automatic Document Selection for Efficient Encoder Pretraining

Yukun Feng, Patrick Xia, Benjamin Van Durme, Jo\~ao Sedoc

PDF

Open Access

TL;DR

This paper introduces an automatic document selection method that enables efficient domain-specific language model pretraining with significantly less data and computational resources, outperforming random selection.

Contribution

It extends Cynical Data Selection to identify domain-representative subsets, reducing data and compute needs while maintaining or improving model performance.

Findings

01

Outperforms random selection in perplexity and downstream tasks

02

Uses 20x less data and 3x fewer training iterations

03

Reduces estimated cloud compute cost by 2x

Abstract

Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification