Data Management For Training Large Language Models: A Survey

Zige Wang; Wanjun Zhong; Yufei Wang; Qi Zhu; Fei Mi; Baojun Wang,; Lifeng Shang; Xin Jiang; Qun Liu

arXiv:2312.01700·cs.CL·August 5, 2024·6 cites

Data Management For Training Large Language Models: A Survey

Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang,, Lifeng Shang, Xin Jiang, Qun Liu

PDF

Open Access 1 Repo

TL;DR

This survey reviews current data management strategies for training large language models, emphasizing their importance in improving model performance and training efficiency, and discusses future challenges and directions.

Contribution

It provides a comprehensive overview of data management practices in LLM training and outlines future research challenges and promising directions.

Findings

01

Current practices are not fully understood mechanistically.

02

Efficient data management enhances LLM performance.

03

Future research should address identified challenges.

Abstract

Data plays a fundamental role in training Large Language Models (LLMs). Efficient data management, particularly in formulating a well-suited training dataset, is significant for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning stages. Despite the considerable importance of data management, the underlying mechanism of current prominent practices are still unknown. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various aspects of data management strategy design. Looking into the future, we extrapolate existing challenges and outline promising directions for development in this field.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zigew/data_management_llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management