Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai
Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat, Limkonchotiwat

TL;DR
This paper introduces a seed-free synthetic data generation framework for instruction-tuning large language models in low-resource languages, demonstrated through a case study in Thai, achieving competitive results with significantly less data.
Contribution
The authors propose a novel seed-data-free framework that generates diverse, fluent, and culturally relevant instruction data for low-resource languages, improving instruction-tuning efficiency.
Findings
Synthetic dataset with 5,000 instructions rivals large-scale datasets.
Incorporating fluency, diversity, and cultural context enhances model performance.
Framework is publicly available for further research.
Abstract
We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Assessment
