Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

Weihua Zheng; Chang Liu; Zhengyuan Liu; Xin Huang; Kui Wu; Muhammad Huzaifah Md Shahrin; Aiti Aw; Roy Ka-Wei Lee

arXiv:2604.10590·cs.CL·April 14, 2026

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

Weihua Zheng, Chang Liu, Zhengyuan Liu, Xin Huang, Kui Wu, Muhammad Huzaifah Md Shahrin, Aiti Aw, Roy Ka-Wei Lee

PDF

TL;DR

This paper introduces a cross-lingual mapping task during pre-training of multilingual LLMs, significantly improving cross-lingual tasks like translation and question answering without extensive parallel data.

Contribution

It proposes a bi-directional language mapping method and a Language Alignment Coefficient to enhance cross-lingual performance during pre-training.

Findings

01

Up to 11.9 BLEU points improvement in machine translation

02

6.72 points increase in CLQA BERTScore-Precision

03

Over 5% gain in CLNLU accuracy

Abstract

Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.