DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

Ruiyao Xu; Noelle I. Samia; Han Liu

arXiv:2603.12932·cs.CL·March 17, 2026

DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

Ruiyao Xu, Noelle I. Samia, Han Liu

PDF

Open Access 1 Video

TL;DR

DS$^2$-Instruct is a zero-shot framework that synthesizes high-quality, domain-specific instruction datasets for LLMs, improving domain adaptation without human annotation by leveraging keyword generation, cognitive pairing, and self-validation.

Contribution

The paper introduces DS$^2$-Instruct, a novel zero-shot data synthesis method for domain-specific instruction tuning, addressing limitations of general-purpose data generation.

Findings

01

Models fine-tuned on DS$^2$-Instruct data outperform existing methods.

02

Effective across diverse domains like mathematics, finance, and reasoning.

03

Self-consistency validation enhances data quality.

Abstract

Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS $^{2}$ -Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DS2-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification