Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?
G\"urkan Soykan, G\"ozde G\"ul \c{S}ahin

TL;DR
This paper investigates how selecting a diverse set of languages based on linguistic features can improve multilingual instruction tuning performance, offering a systematic approach to enhance model generalization across languages.
Contribution
It introduces a linguistically informed method for selecting languages in instruction tuning, demonstrating improved performance over random selection.
Findings
Linguistically diverse language selection improves model performance
Careful language choice outperforms random selection in benchmarks
Resources and code are publicly available for reproducibility
Abstract
Multilingual language models often perform unevenly across different languages due to limited generalization capabilities for some languages. This issue is significant because of the growing interest in making universal language models that work well for all languages. Instruction tuning with multilingual instruction-response pairs has been used to improve model performance across various languages. However, this approach is challenged by high computational costs, a lack of quality tuning data for all languages, and the "curse of multilinguality" -- the performance drop per language after adding many languages. Recent studies have found that working with datasets with few languages and a smaller number of instances can be beneficial. Yet, there exists no systematic investigation into how choosing different languages affects multilingual instruction tuning. Our study proposes a method to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEFL/ESL Teaching and Learning · Second Language Learning and Teaching
