Data Management For Training Large Language Models: A Survey
Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang,, Lifeng Shang, Xin Jiang, Qun Liu

TL;DR
This survey reviews current data management strategies for training large language models, emphasizing their importance in improving model performance and training efficiency, and discusses future challenges and directions.
Contribution
It provides a comprehensive overview of data management practices in LLM training and outlines future research challenges and promising directions.
Findings
Current practices are not fully understood mechanistically.
Efficient data management enhances LLM performance.
Future research should address identified challenges.
Abstract
Data plays a fundamental role in training Large Language Models (LLMs). Efficient data management, particularly in formulating a well-suited training dataset, is significant for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning stages. Despite the considerable importance of data management, the underlying mechanism of current prominent practices are still unknown. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various aspects of data management strategy design. Looking into the future, we extrapolate existing challenges and outline promising directions for development in this field.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
