BilBOWA: Fast Bilingual Distributed Representations without Word   Alignments

Stephan Gouws; Yoshua Bengio; Greg Corrado

arXiv:1410.2455·stat.ML·February 5, 2016·314 cites

BilBOWA: Fast Bilingual Distributed Representations without Word Alignments

Stephan Gouws, Yoshua Bengio, Greg Corrado

PDF

Open Access 2 Repos

TL;DR

BilBOWA introduces a fast, scalable method for learning bilingual word embeddings directly from monolingual data and minimal sentence-aligned data, without requiring explicit word alignments.

Contribution

The paper presents a novel, efficient bilingual embedding model that eliminates the need for word alignments, enabling scalable cross-lingual representation learning from monolingual and limited aligned data.

Findings

01

Outperforms state-of-the-art on cross-lingual document classification

02

Achieves better lexical translation accuracy on WMT11 data

03

Scales efficiently to large monolingual datasets

Abstract

We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperform state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on WMT11 data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis