Investigating Public Fine-Tuning Datasets: A Complex Review of Current Practices from a Construction Perspective
Runyuan Ma, Wei Li, Fukai Shang

TL;DR
This paper provides a comprehensive review of public fine-tuning datasets for large language models, focusing on their construction techniques, evolution, taxonomy, and future directions from a data engineering perspective.
Contribution
It offers a detailed taxonomy, construction methods, and a category tree for fine-tuning datasets, enhancing understanding of dataset development for LLMs.
Findings
Overview of dataset evolution and taxonomy
Detailed analysis of data generation and augmentation techniques
Insights into future trends in dataset construction
Abstract
With the rapid development of the large model domain, research related to fine-tuning has concurrently seen significant advancement, given that fine-tuning is a constituent part of the training process for large-scale models. Data engineering plays a fundamental role in the training process of models, which includes data infrastructure, data processing, etc. Data during fine-tuning likewise forms the base for large models. In order to embrace the power and explore new possibilities of fine-tuning datasets, this paper reviews current public fine-tuning datasets from the perspective of data construction. An overview of public fine-tuning datasets from two sides: evolution and taxonomy, is provided in this review, aiming to chart the development trajectory. Construction techniques and methods for public fine-tuning datasets of Large Language Models (LLMs), including data generation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstruction Project Management and Performance
MethodsBalanced Selection
