Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
Lingfeng Ming, Bo Zeng, Chenyang Lyu, Tianqi Shi, Yu Zhao, Xue Yang,, Yefeng Liu, Yiyu Wang, Linlong Xu, Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng, Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue Wang, Weihua Luo,, Kaifu Zhang

TL;DR
Marco-LLM is a new multilingual large language model trained on extensive data from low-resource languages, significantly improving cross-lingual tasks and machine translation, thus bridging language performance gaps.
Contribution
Introduces Marco-LLM, a multilingual LLM trained with massive data for low-resource languages, achieving state-of-the-art results across various multilingual benchmarks.
Findings
Substantial improvements on multilingual benchmarks.
Enhanced performance in low-resource language tasks.
Effective in any-to-any machine translation.
Abstract
Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Communication and Language · Language Development and Disorders · Machine Learning and Algorithms
