Exploiting Similarities among Languages for Machine Translation
Tomas Mikolov, Quoc V. Le, Ilya Sutskever

TL;DR
This paper presents a simple yet effective method for automating dictionary and phrase table generation in machine translation by leveraging language similarities through distributed representations and linear mappings, achieving high accuracy.
Contribution
It introduces a language-agnostic approach that learns cross-lingual mappings from monolingual data, improving translation resources with minimal bilingual data.
Findings
Achieves nearly 90% precision@5 for English-Spanish word translation
Effective across diverse language pairs with minimal assumptions
Automates extension of translation dictionaries and phrase tables
Abstract
Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
