Cross-lingual paraphrase identification
Inessa Fedorova, Aleksei Musatow

TL;DR
This paper presents a contrastively trained bi-encoder model for cross-lingual paraphrase identification, achieving competitive performance with state-of-the-art methods while maintaining high-quality embeddings for multilingual semantic tasks.
Contribution
It introduces a contrastive bi-encoder approach for multilingual paraphrase detection that balances performance and embedding quality, enabling versatile downstream applications.
Findings
Performance comparable to state-of-the-art cross-encoders with minimal drop
Effective embeddings for semantic search across languages
Maintains high embedding quality for downstream tasks
Abstract
The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
