Poro 34B and the Blessing of Multilinguality
Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, V\"ain\"o Hatanp\"a\"a, Peter Sarlin, Sampo Pyysalo

TL;DR
This paper introduces Poro 34B, a multilingual language model trained on Finnish, English, and programming languages, demonstrating that multilingual training can enhance performance for low-resource languages and translation tasks.
Contribution
The study shows that multilingual training with limited data can outperform monolingual models for specific languages, and introduces a new 34B parameter model with open access.
Findings
Outperforms existing models for Finnish and translation tasks
Achieves competitive performance in English and programming languages
Demonstrates benefits of multilingual training with limited data
Abstract
The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistics and language evolution · Language, Linguistics, Cultural Analysis
MethodsFocus
