DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning
Ruiyao Xu, Noelle I. Samia, Han Liu

TL;DR
DS$^2$-Instruct is a zero-shot framework that synthesizes high-quality, domain-specific instruction datasets for LLMs, improving domain adaptation without human annotation by leveraging keyword generation, cognitive pairing, and self-validation.
Contribution
The paper introduces DS$^2$-Instruct, a novel zero-shot data synthesis method for domain-specific instruction tuning, addressing limitations of general-purpose data generation.
Findings
Models fine-tuned on DS$^2$-Instruct data outperform existing methods.
Effective across diverse domains like mathematics, finance, and reasoning.
Self-consistency validation enhances data quality.
Abstract
Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
