CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
Ingo Ziegler, Abdullatif K\"oksal, Desmond Elliott, Hinrich Sch\"utze

TL;DR
CRAFT is a method that efficiently generates large, task-specific synthetic datasets using corpus retrieval and large language models, improving performance on diverse tasks like QA and summarization.
Contribution
It introduces a novel approach combining corpus retrieval and instruction-tuned LLM augmentation to create high-quality datasets from minimal initial examples.
Findings
CRAFT-generated datasets lead to models that outperform or match general LLMs on QA.
CRAFT surpasses models trained on human-curated summarization data by 46 preference points.
The method remains robust despite variations in initial few-shots quality.
Abstract
Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given these examples, CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA), as well as summarization. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
