Kakugo: Distillation of Low-Resource Languages into Small Language Models
Peter Devine, Mardhiyah Sanni, Farid Adilazuarda, Julieta Gil Loizaga, Barry Haddow

TL;DR
Kakugo introduces a cost-effective pipeline for training small language models for low-resource languages using synthetic data generated by large teacher models, enabling accessible language-specific AI development.
Contribution
The paper presents a novel pipeline that efficiently trains small language models for 54 low-resource languages using only language names and synthetic data.
Findings
Performance improvements across NLP tasks
Cost-effective training under $50 per language
Effective for diverse low-resource languages
Abstract
We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under $50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ptrdvn/kakugo-3B-ibomodel· 2 dl· ♡ 12 dl♡ 1
- 🤗spiu/distill-testmodel
- 🤗ptrdvn/kakugo-3B-glemodel· 2 dl· ♡ 22 dl♡ 2
- 🤗ptrdvn/kakugo-3B-amhmodel· 5 dl· ♡ 25 dl♡ 2
- 🤗ptrdvn/kakugo-3B-asmmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗ptrdvn/kakugo-3B-apcmodel· 5 dl· ♡ 25 dl♡ 2
- 🤗ptrdvn/kakugo-3B-arsmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗ptrdvn/kakugo-3B-arzmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗ptrdvn/kakugo-3B-astmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗ptrdvn/kakugo-3B-azbmodel· 7 dl· ♡ 17 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · ICT in Developing Communities · Topic Modeling
