Cross-lingual paraphrase identification

Inessa Fedorova; Aleksei Musatow

arXiv:2406.15066·cs.CL·June 24, 2024

Cross-lingual paraphrase identification

Inessa Fedorova, Aleksei Musatow

PDF

Open Access

TL;DR

This paper presents a contrastively trained bi-encoder model for cross-lingual paraphrase identification, achieving competitive performance with state-of-the-art methods while maintaining high-quality embeddings for multilingual semantic tasks.

Contribution

It introduces a contrastive bi-encoder approach for multilingual paraphrase detection that balances performance and embedding quality, enabling versatile downstream applications.

Findings

01

Performance comparable to state-of-the-art cross-encoders with minimal drop

02

Effective embeddings for semantic search across languages

03

Maintains high embedding quality for downstream tasks

Abstract

The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification