Harnessing Diversity for Important Data Selection in Pretraining Large Language Models
Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang,, Xinlin Zhuang, Tianyi Bai, Jiantao Qiu, Lei Cao, Ju Fan, Ye Yuan, Guoren Wang, and Conghui He

TL;DR
This paper introduces exttt{Quad}, a novel data selection method for pretraining large language models that balances data quality and diversity using influence scores, clustering, and multi-armed bandits, leading to improved performance.
Contribution
The paper proposes exttt{Quad}, a new influence-based data selection approach that efficiently balances quality and diversity for large language model pretraining.
Findings
State-of-the-art pretraining results achieved.
Enhanced influence evaluation via adapted $iHVP$ methods.
Effective balancing of data quality and diversity.
Abstract
Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top- instances with the highest scores. However, this approach has several limitations. (1) Computing the influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pre-trained model's ability to generalize effectively to various downstream tasks. In this paper, we introduce \texttt{Quad}, a data selection approach that considers both quality and diversity by using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · OPT
