Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

Lester James V. Miranda; Ivan Vuli\'c; Anna Korhonen

arXiv:2604.11290·cs.CL·April 14, 2026

Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

Lester James V. Miranda, Ivan Vuli\'c, Anna Korhonen

PDF

1 Repo 8 Models 1 Datasets

TL;DR

This paper systematically evaluates multilingual language models as teachers for synthetic data generation, identifying key data qualities that predict student performance and providing practical guidance for effective multilingual training.

Contribution

It introduces the Polyglot Score to measure data quality, analyzes factors influencing teacher effectiveness beyond scale, and offers practical recommendations for multilingual data synthesis.

Findings

01

Gemma 3 27B and Aya Expanse 32B are consistently effective teachers.

02

Data qualities like prompt diversity and fluency predict over 93% of data quality variance.

03

Matching teacher-student model families and translating prompts improve performance for low-resource languages.

Abstract

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ljvmiranda921/polyglot-teachers
github

Models

Datasets

ljvmiranda921/PolyglotTeachers-SFT-Synth
dataset· 177 dl
177 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.