Training Data for Large Language Model

Yiming Ju; Huanhuan Ma

arXiv:2411.07715·cs.AI·November 13, 2024

Training Data for Large Language Model

Yiming Ju, Huanhuan Ma

PDF

Open Access

TL;DR

This paper reviews the current landscape of datasets used for training large language models, emphasizing data collection, processing, and open-source resources crucial for advancing AI capabilities.

Contribution

It provides a comprehensive overview of data sources, methodologies, and open datasets for pretraining and fine-tuning large language models, highlighting key aspects and challenges.

Findings

01

Large-scale datasets are essential for state-of-the-art language models.

02

Open-source datasets play a critical role in democratizing AI research.

03

Data quality and processing workflows significantly impact model performance.

Abstract

In 2022, with the release of ChatGPT, large-scale language models gained widespread attention. ChatGPT not only surpassed previous models in terms of parameters and the scale of its pretraining corpus but also achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This progress has led enterprises and research institutions to recognize that building smarter and more powerful models relies on rich and high-quality datasets. Consequently, the construction and optimization of datasets have become a critical focus in the field of artificial intelligence. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models, covering aspects such as data scale, collection methods, data types and characteristics, processing workflows, and provides an overview of available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsFocus