SimAlign: High Quality Word Alignments without Parallel Training Data   using Static and Contextualized Embeddings

Masoud Jalili Sabet; Philipp Dufter; Fran\c{c}ois Yvon; Hinrich; Sch\"utze

arXiv:2004.08728·cs.CL·April 19, 2021·1 cites

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

Masoud Jalili Sabet, Philipp Dufter, Fran\c{c}ois Yvon, Hinrich, Sch\"utze

PDF

Open Access 3 Repos

TL;DR

This paper introduces SimAlign, a novel word alignment method that uses multilingual embeddings derived solely from monolingual data, achieving high-quality alignments without any parallel training data, outperforming traditional statistical aligners in many cases.

Contribution

The paper presents a new approach to word alignment that eliminates the need for parallel data by leveraging static and contextualized multilingual embeddings created from monolingual sources.

Findings

01

Alignments from embeddings outperform statistical aligners on four language pairs.

02

Contextualized embeddings achieve 5 percentage points higher F1 than Eflomal for English-German.

03

Method performs comparably on two language pairs, demonstrating versatility.

Abstract

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners, even with abundant parallel data; e.g., contextualized embeddings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification