EuroLLM: Multilingual Language Models for Europe
Pedro Henrique Martins, Patrick Fernandes, Jo\~ao Alves, Nuno M., Guerreiro, Ricardo Rei, Duarte M. Alves, Jos\'e Pombal, Amin Farajian, Manuel, Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, Jos\'e G. C. de, Souza, Alexandra Birch, Andr\'e F. T. Martins

TL;DR
EuroLLM introduces open-weight multilingual language models tailored for all EU languages, enhancing multilingual understanding and generation with new data, models, and benchmarks.
Contribution
The paper presents the development of EuroLLM models, including data collection, scaling laws, tokenizer design, and initial models, advancing multilingual LLM capabilities for European languages.
Findings
EuroLLM models perform well on multilingual benchmarks
Models demonstrate effective machine translation capabilities
Data and modeling strategies improve multilingual understanding
Abstract
The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGovernment, Law, and Information Management · Linguistic Studies and Language Acquisition
