Bilingual Distributed Word Representations from Document-Aligned Comparable Data
Ivan Vuli\'c, Marie-Francine Moens

TL;DR
This paper introduces a novel method for learning bilingual word embeddings from document-aligned comparable data without relying on parallel corpora or lexical resources, demonstrating superior performance in semantic tasks.
Contribution
The authors present a new model that learns bilingual word representations solely from document-aligned data, outperforming previous models that depended on parallel data or lexical resources.
Findings
Significantly outperforms previous models on bilingual lexicon extraction.
Achieves the best results in suggesting word translations in context.
Effective for multiple language pairs.
Abstract
We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
