Building Domain-Specific Small Language Models via Guided Data Generation

Aman Kumar; Ekant Muljibhai Amin; Xian Yeow Lee; Lasitha Vidyaratne; Ahmed K. Farahat; Dipanjan D. Ghosh; Yuta Koreeda; Chetan Gupta

arXiv:2511.21748·cs.CL·December 1, 2025

Building Domain-Specific Small Language Models via Guided Data Generation

Aman Kumar, Ekant Muljibhai Amin, Xian Yeow Lee, Lasitha Vidyaratne, Ahmed K. Farahat, Dipanjan D. Ghosh, Yuta Koreeda, Chetan Gupta

PDF

Open Access

TL;DR

This paper presents a scalable pipeline for creating small, domain-specific language models using guided synthetic data generation, improving performance on industrial diagnostic tasks while addressing data privacy and resource constraints.

Contribution

The authors introduce a cost-effective training pipeline combining synthetic data generation, domain-adaptive pretraining, supervised fine-tuning, and preference optimization for small domain-specific LLMs.

Findings

01

DiagnosticSLM outperforms larger open-source models on MCQ tasks by up to 25% accuracy.

02

The pipeline enables effective domain adaptation with limited seed data.

03

DiagnosticSLM achieves competitive results across multiple diagnostic benchmarks.

Abstract

Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Text Readability and Simplification