Orion-14B: Open-source Multilingual Large Language Models
Du Chen, Yi Huang, Xiaopu Li, Yongqiang Li, Yongqiang Liu, Haihui Pan,, Leichao Xu, Dacheng Zhang, Zhipeng Zhang, Kun Han

TL;DR
Orion-14B is an open-source multilingual large language model with 14 billion parameters, trained on 2.5 trillion tokens across multiple languages, achieving state-of-the-art performance and supporting diverse applications.
Contribution
This paper introduces Orion-14B, a new multilingual LLM trained on a massive diverse dataset, with open-source code and models for research and practical use.
Findings
Achieves state-of-the-art performance on various tasks
Trained on 2.5 trillion tokens from multiple languages
Provides open-source models and code for community use
Abstract
In this study, we introduce Orion-14B, a collection of multilingual large language models with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
