EuroLLM-9B: Technical Report
Pedro Henrique Martins, Jo\~ao Alves, Patrick Fernandes, Nuno M. Guerreiro, Ricardo Rei, Amin Farajian, Mateusz Klimaszewski, Duarte M. Alves, Jos\'e Pombal, Nicolas Boizard, Manuel Faysse, Pierre Colombo, Fran\c{c}ois Yvon, Barry Haddow, Jos\'e G. C. de Souza, Alexandra Birch

TL;DR
EuroLLM-9B is a large multilingual language model designed specifically for European languages, addressing underrepresentation issues and demonstrating competitive performance on benchmarks, with all components openly available.
Contribution
The paper introduces EuroLLM-9B, a new open-source multilingual LLM tailored for European languages, including novel datasets and filtering techniques to improve language coverage.
Findings
EuroLLM-9B performs well on multilingual benchmarks.
EuroLLM-9B advances European language support in open LLMs.
All components are publicly released for research use.
Abstract
This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
MethodsBalanced Selection
