LLM Data Selection and Utilization via Dynamic Bi-level Optimization
Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao

TL;DR
This paper introduces a dynamic bi-level optimization approach for data selection in training large language models, improving efficiency and model performance by adaptively weighting data during training.
Contribution
It proposes a novel Data Weighting Model (DWM) with bi-level optimization to dynamically adjust data importance, outperforming static data selection methods.
Findings
DWM improves model performance with randomly-selected data.
The learned weighting model transfers across different data selection methods.
Analysis reveals evolving data preferences during training.
Abstract
While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques
