Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA
Sandipan Majhi, Paheli Bhattacharya

TL;DR
This paper introduces a multi-stage finetuning approach that leverages synthetic data generated by large language models to adapt lightweight models for Hindi tourism question answering, addressing low-resource challenges.
Contribution
It presents a novel strategy combining synthetic data generation and multi-stage finetuning to improve domain adaptation of small language models in low-resource settings.
Findings
Synthetic data effectively enhances domain adaptation.
Large models efficiently generate high-quality synthetic data.
Small models successfully adapt to synthetic data for domain-specific QA.
Abstract
Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
