Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T, Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry, Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav, Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello

TL;DR
Aurora-M is a 15B parameter open-source multilingual language and code model trained with extensive continual pretraining, demonstrating strong multilingual and safety performance, aimed at democratizing AI access and responsible development.
Contribution
This paper introduces Aurora-M, the first open-source multilingual model trained with continual pretraining on over 2 trillion tokens, including safety-aligned fine-tuning and evaluation across diverse tasks.
Findings
Aurora-M outperforms existing models in multilingual tasks.
It demonstrates robustness against catastrophic forgetting.
Aurora-M aligns with safety standards and regulatory concerns.
Abstract
Pretrained language models are an integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLibrary Science and Information Systems
MethodsALIGN · BLOOM
