CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Ingo Ziegler; Abdullatif K\"oksal; Desmond Elliott; Hinrich Sch\"utze

arXiv:2409.02098·cs.CL·December 8, 2025

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Ingo Ziegler, Abdullatif K\"oksal, Desmond Elliott, Hinrich Sch\"utze

PDF

1 Repo 5 Datasets 1 Video

TL;DR

CRAFT is a method that efficiently generates large, task-specific synthetic datasets using corpus retrieval and large language models, improving performance on diverse tasks like QA and summarization.

Contribution

It introduces a novel approach combining corpus retrieval and instruction-tuned LLM augmentation to create high-quality datasets from minimal initial examples.

Findings

01

CRAFT-generated datasets lead to models that outperform or match general LLMs on QA.

02

CRAFT surpasses models trained on human-curated summarization data by 46 preference points.

03

The method remains robust despite variations in initial few-shots quality.

Abstract

Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given these examples, CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA), as well as summarization. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ziegler-ingo/CRAFT
pytorchOfficial

Datasets

Videos

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation· underline