TL;DR
This paper introduces an unsupervised, resource-light method for cross-lingual semantic textual similarity using bilingual word embeddings, achieving comparable performance to complex models across multiple tasks and language pairs.
Contribution
The paper presents a novel unsupervised approach that uses bilingual word embeddings and minimal translation pairs, avoiding reliance on extensive language resources or tools.
Findings
Achieves near state-of-the-art performance on semantic similarity datasets.
Effective in cross-lingual tasks like plagiarism detection and parallel sentence extraction.
Stable results across diverse language pairs.
Abstract
Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
