Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is   Needed?

Tannon Kew; Florian Schottmann; Rico Sennrich

arXiv:2312.12683·cs.CL·October 4, 2024·2 cites

Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?

Tannon Kew, Florian Schottmann, Rico Sennrich

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This study explores the minimal multilingual training needed to enable English-centric large language models to generalize across multiple languages, showing that training on just a few languages can be effective.

Contribution

It demonstrates that multilingual instruction tuning with only two to three languages suffices for cross-lingual generalization in English-centric LLMs, highlighting the importance of pretraining exposure.

Findings

01

Multilingual instruction tuning improves cross-lingual transfer.

02

Few languages (2-3) are enough for effective generalization.

03

Most beneficial for generative, input/output language agreement tasks.

Abstract

The vast majority of today's large language models (LLMs) are English-centric, having been pretrained predominantly on English text. Yet, in order to meet user expectations, models need to be able to respond appropriately in multiple languages once deployed in downstream applications. This requires strong cross-lingual transfer abilities. In this work, we investigate the minimal amount of multilinguality required during finetuning to elicit cross-lingual generalisation in English-centric LLMs. In experiments across four LLMs, we find that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-lingual generalisation, with the limiting factor being the degree to which a target language is seen during pretraining. Evaluations on five different tasks further reveal that multilingual instruction tuning is most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zurichnlp/multilingual-instruction-tuning
noneOfficial

Datasets

Videos

Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification