A Survey on Data Selection for Language Models
Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan, Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon, Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

TL;DR
This survey reviews data selection methods for large language models, highlighting their importance in improving training efficiency, reducing costs, and identifying gaps for future research in the field.
Contribution
It provides a comprehensive taxonomy of existing data selection approaches and identifies key gaps and future directions for research.
Findings
Data selection can improve model training efficiency and reduce costs.
Current research is concentrated within few organizations with limited open sharing.
The survey identifies gaps and proposes future research avenues.
Abstract
A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
