Building Domain-Specific Small Language Models via Guided Data Generation
Aman Kumar, Ekant Muljibhai Amin, Xian Yeow Lee, Lasitha Vidyaratne, Ahmed K. Farahat, Dipanjan D. Ghosh, Yuta Koreeda, Chetan Gupta

TL;DR
This paper presents a scalable pipeline for creating small, domain-specific language models using guided synthetic data generation, improving performance on industrial diagnostic tasks while addressing data privacy and resource constraints.
Contribution
The authors introduce a cost-effective training pipeline combining synthetic data generation, domain-adaptive pretraining, supervised fine-tuning, and preference optimization for small domain-specific LLMs.
Findings
DiagnosticSLM outperforms larger open-source models on MCQ tasks by up to 25% accuracy.
The pipeline enables effective domain adaptation with limited seed data.
DiagnosticSLM achieves competitive results across multiple diagnostic benchmarks.
Abstract
Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Text Readability and Simplification
