Investigating Public Fine-Tuning Datasets: A Complex Review of Current   Practices from a Construction Perspective

Runyuan Ma; Wei Li; Fukai Shang

arXiv:2407.08475·cs.CL·July 12, 2024

Investigating Public Fine-Tuning Datasets: A Complex Review of Current Practices from a Construction Perspective

Runyuan Ma, Wei Li, Fukai Shang

PDF

Open Access

TL;DR

This paper provides a comprehensive review of public fine-tuning datasets for large language models, focusing on their construction techniques, evolution, taxonomy, and future directions from a data engineering perspective.

Contribution

It offers a detailed taxonomy, construction methods, and a category tree for fine-tuning datasets, enhancing understanding of dataset development for LLMs.

Findings

01

Overview of dataset evolution and taxonomy

02

Detailed analysis of data generation and augmentation techniques

03

Insights into future trends in dataset construction

Abstract

With the rapid development of the large model domain, research related to fine-tuning has concurrently seen significant advancement, given that fine-tuning is a constituent part of the training process for large-scale models. Data engineering plays a fundamental role in the training process of models, which includes data infrastructure, data processing, etc. Data during fine-tuning likewise forms the base for large models. In order to embrace the power and explore new possibilities of fine-tuning datasets, this paper reviews current public fine-tuning datasets from the perspective of data construction. An overview of public fine-tuning datasets from two sides: evolution and taxonomy, is provided in this review, aiming to chart the development trajectory. Construction techniques and methods for public fine-tuning datasets of Large Language Models (LLMs), including data generation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstruction Project Management and Performance

MethodsBalanced Selection