Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation
Lester James V. Miranda, Ivan Vuli\'c, Anna Korhonen

TL;DR
This paper systematically evaluates multilingual language models as teachers for synthetic data generation, identifying key data qualities that predict student performance and providing practical guidance for effective multilingual training.
Contribution
It introduces the Polyglot Score to measure data quality, analyzes factors influencing teacher effectiveness beyond scale, and offers practical recommendations for multilingual data synthesis.
Findings
Gemma 3 27B and Aya Expanse 32B are consistently effective teachers.
Data qualities like prompt diversity and fluency predict over 93% of data quality variance.
Matching teacher-student model families and translating prompts improve performance for low-resource languages.
Abstract
Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ljvmiranda921/Polyglot-Gemma3-4B-SFT-armodel· 3 dl3 dl
- 🤗ljvmiranda921/Polyglot-OLMo3-7B-SFT-armodel· 3 dl3 dl
- 🤗ljvmiranda921/Polyglot-OLMo3-7B-SFT-csmodel· 7 dl7 dl
- 🤗ljvmiranda921/Polyglot-Gemma3-4B-SFT-demodel· 5 dl5 dl
- 🤗ljvmiranda921/Polyglot-Gemma3-4B-SFT-idmodel· 6 dl6 dl
- 🤗ljvmiranda921/Polyglot-OLMo3-7B-SFT-esmodel· 8 dl8 dl
- 🤗ljvmiranda921/Polyglot-OLMo3-7B-SFT-demodel· 10 dl10 dl
- 🤗ljvmiranda921/Polyglot-Gemma3-4B-SFT-tlmodel· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
