EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning
Ping Guo, Xiangpeng Wei, Yue Hu, Baosong Yang, Dayiheng Liu, Fei, Huang, Jun Xie

TL;DR
EMMA-X introduces an EM-like algorithm for multilingual pre-training that leverages non-parallel data to learn universal cross-lingual representations, achieving state-of-the-art results on a comprehensive benchmark.
Contribution
The paper proposes EMMA-X, a novel EM-based pre-training method that effectively utilizes non-parallel data for cross-lingual universals, advancing multilingual representation learning.
Findings
Achieves state-of-the-art performance on 12 cross-lingual tasks.
Demonstrates the effectiveness of EMMA-X in learning universal representations.
Provides geometric analysis confirming the quality of learned representations.
Abstract
Expressing universal semantics common to all languages is helpful in understanding the meanings of complex and culture-specific sentences. The research theme underlying this scenario focuses on learning universal representations across languages with the usage of massive parallel corpora. However, due to the sparsity and scarcity of parallel data, there is still a big challenge in learning authentic ``universals'' for any two languages. In this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm, to learn (X)Cross-lingual universals with the aid of excessive multilingual non-parallel data. EMMA-X unifies the cross-lingual representation learning task and an extra semantic relation prediction task within an EM framework. Both the extra semantic classifier and the cross-lingual sentence encoder approximate the semantic relation of two sentences, and supervise each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
