LLM Data Selection and Utilization via Dynamic Bi-level Optimization

Yang Yu; Kai Han; Hang Zhou; Yehui Tang; Kaiqi Huang; Yunhe Wang; Dacheng Tao

arXiv:2507.16178·cs.LG·July 23, 2025

LLM Data Selection and Utilization via Dynamic Bi-level Optimization

Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao

PDF

Open Access 1 Video

TL;DR

This paper introduces a dynamic bi-level optimization approach for data selection in training large language models, improving efficiency and model performance by adaptively weighting data during training.

Contribution

It proposes a novel Data Weighting Model (DWM) with bi-level optimization to dynamically adjust data importance, outperforming static data selection methods.

Findings

01

DWM improves model performance with randomly-selected data.

02

The learned weighting model transfers across different data selection methods.

03

Analysis reveals evolving data preferences during training.

Abstract

While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LLM Data Selection and Utilization via Dynamic Bi-level Optimization· slideslive

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques