Harnessing Diversity for Important Data Selection in Pretraining Large   Language Models

Chi Zhang; Huaping Zhong; Kuan Zhang; Chengliang Chai; Rui Wang,; Xinlin Zhuang; Tianyi Bai; Jiantao Qiu; Lei Cao; Ju Fan; Ye Yuan; Guoren Wang; and Conghui He

arXiv:2409.16986·cs.AI·October 8, 2024·3 cites

Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang,, Xinlin Zhuang, Tianyi Bai, Jiantao Qiu, Lei Cao, Ju Fan, Ye Yuan, Guoren Wang, and Conghui He

PDF

Open Access

TL;DR

This paper introduces exttt{Quad}, a novel data selection method for pretraining large language models that balances data quality and diversity using influence scores, clustering, and multi-armed bandits, leading to improved performance.

Contribution

The paper proposes exttt{Quad}, a new influence-based data selection approach that efficiently balances quality and diversity for large language model pretraining.

Findings

01

State-of-the-art pretraining results achieved.

02

Enhanced influence evaluation via adapted $iHVP$ methods.

03

Effective balancing of data quality and diversity.

Abstract

Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, $i . e .,$ a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top- $k$ instances with the highest scores. However, this approach has several limitations. (1) Computing the influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pre-trained model's ability to generalize effectively to various downstream tasks. In this paper, we introduce \texttt{Quad}, a data selection approach that considers both quality and diversity by using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · OPT