Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Shunyu Zhang; Yaobo Liang; Ming Gong; Daxin Jiang; Nan Duan

arXiv:2302.01626·cs.CL·September 8, 2025·1 cites

Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Shunyu Zhang, Yaobo Liang, Ming Gong, Daxin Jiang, Nan Duan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MSM, a multilingual pre-trained language model that models sequential sentence relations to enhance cross-lingual dense retrieval, significantly outperforming existing models on multiple tasks.

Contribution

The paper proposes a novel masked sentence model (MSM) that captures universal sequential sentence relations across languages for improved cross-lingual retrieval.

Findings

01

MSM outperforms existing models on four cross-lingual retrieval tasks.

02

Model effectively captures sentence order information across languages.

03

Demonstrates stronger cross-lingual retrieval capabilities.

Abstract

Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shunyuzh/MSM
pytorchOfficial

Videos

Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsXLM-R · mBERT