BilBOWA: Fast Bilingual Distributed Representations without Word Alignments
Stephan Gouws, Yoshua Bengio, Greg Corrado

TL;DR
BilBOWA introduces a fast, scalable method for learning bilingual word embeddings directly from monolingual data and minimal sentence-aligned data, without requiring explicit word alignments.
Contribution
The paper presents a novel, efficient bilingual embedding model that eliminates the need for word alignments, enabling scalable cross-lingual representation learning from monolingual and limited aligned data.
Findings
Outperforms state-of-the-art on cross-lingual document classification
Achieves better lexical translation accuracy on WMT11 data
Scales efficiently to large monolingual datasets
Abstract
We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperform state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on WMT11 data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
