Training Data for Large Language Model
Yiming Ju, Huanhuan Ma

TL;DR
This paper reviews the current landscape of datasets used for training large language models, emphasizing data collection, processing, and open-source resources crucial for advancing AI capabilities.
Contribution
It provides a comprehensive overview of data sources, methodologies, and open datasets for pretraining and fine-tuning large language models, highlighting key aspects and challenges.
Findings
Large-scale datasets are essential for state-of-the-art language models.
Open-source datasets play a critical role in democratizing AI research.
Data quality and processing workflows significantly impact model performance.
Abstract
In 2022, with the release of ChatGPT, large-scale language models gained widespread attention. ChatGPT not only surpassed previous models in terms of parameters and the scale of its pretraining corpus but also achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This progress has led enterprises and research institutions to recognize that building smarter and more powerful models relies on rich and high-quality datasets. Consequently, the construction and optimization of datasets have become a critical focus in the field of artificial intelligence. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models, covering aspects such as data scale, collection methods, data types and characteristics, processing workflows, and provides an overview of available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsFocus
