CDA: a Cost Efficient Content-based Multilingual Web Document Aligner

Thuy Vu; Alessandro Moschitti

arXiv:2102.10246·cs.CL·February 23, 2021

CDA: a Cost Efficient Content-based Multilingual Web Document Aligner

Thuy Vu, Alessandro Moschitti

PDF

TL;DR

CDA is an efficient, scalable, and robust content-based method for aligning multilingual web documents, creating high-quality parallel data for machine translation at industrial scale.

Contribution

It introduces a novel multilingual document alignment approach using content representations and demonstrates its effectiveness on large-scale web data across many languages.

Findings

01

Achieves performance comparable to state-of-the-art in document alignment.

02

Robustly handles noisy, large-scale web data.

03

Scales effectively to low-resource languages.

Abstract

We introduce a Content-based Document Alignment approach (CDA), an efficient method to align multilingual web documents based on content in creating parallel training data for machine translation (MT) systems operating at the industrial level. CDA works in two steps: (i) projecting documents of a web domain to a shared multilingual space; then (ii) aligning them based on the similarity of their representations in such space. We leverage lexical translation models to build vector representations using TF-IDF. CDA achieves performance comparable with state-of-the-art systems in the WMT-16 Bilingual Document Alignment Shared Task benchmark while operating in multilingual space. Besides, we created two web-scale datasets to examine the robustness of CDA in an industrial setting involving up to 28 languages and millions of documents. The experiments show that CDA is robust, cost-effective,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.