Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

Ziheng Li; Shaohan Huang; Zihan Zhang; Zhi-Hong Deng; Qiang Lou,; Haizhen Huang; Jian Jiao; Furu Wei; Weiwei Deng; Qi Zhang

arXiv:2305.09148·cs.CL·May 17, 2023·1 cites

Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

Ziheng Li, Shaohan Huang, Zihan Zhang, Zhi-Hong Deng, Qiang Lou,, Haizhen Huang, Jian Jiao, Furu Wei, Weiwei Deng, Qi Zhang

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces a dual-alignment pre-training framework that enhances cross-lingual sentence embeddings by combining sentence-level and token-level alignment, using a novel representation translation learning task.

Contribution

It proposes a new dual-alignment pre-training method with a representation translation learning task for better multilingual sentence embeddings.

Findings

01

Significant improvements on three cross-lingual benchmarks

02

RTL is more suitable and efficient than translation language modeling

03

Effective integration of token-level and sentence-level alignment

Abstract

Recent studies have shown that dual encoder models trained with the sentence-level translation ranking task are effective methods for cross-lingual sentence embedding. However, our research indicates that token-level alignment is also crucial in multilingual scenarios, which has not been fully explored previously. Based on our findings, we propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding that incorporates both sentence-level and token-level alignment. To achieve this, we introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart. This reconstruction objective encourages the model to embed translation information into the token representation. Compared to other token-level alignment methods such as translation language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chillingdream/dap
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques