A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

Goran Glava\v{s}; Marc Franco-Salvador; Simone Paolo Ponzetto; Paolo; Rosso

arXiv:1801.06436·cs.CL·January 22, 2018

A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

Goran Glava\v{s}, Marc Franco-Salvador, Simone Paolo Ponzetto, Paolo, Rosso

PDF

1 Repo

TL;DR

This paper introduces an unsupervised, resource-light method for cross-lingual semantic textual similarity using bilingual word embeddings, achieving comparable performance to complex models across multiple tasks and language pairs.

Contribution

The paper presents a novel unsupervised approach that uses bilingual word embeddings and minimal translation pairs, avoiding reliance on extensive language resources or tools.

Findings

01

Achieves near state-of-the-art performance on semantic similarity datasets.

02

Effective in cross-lingual tasks like plagiarism detection and parallel sentence extraction.

03

Stable results across diverse language pairs.

Abstract

Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://bitbucket.org/gg42554/cl-sts
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.