CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand, Joulin

TL;DR
This paper introduces CCMatrix, a large-scale method for mining billions of high-quality parallel sentences across 38 languages from web data, significantly improving machine translation quality without human-labeled data.
Contribution
The authors present a unified margin-based bitext mining approach applied to massive monolingual corpora, creating the largest multilingual parallel corpus to date for training translation systems.
Findings
Mined 4.5 billion parallel sentences from 32.7 billion monolingual sentences.
Achieved state-of-the-art translation performance on multiple language pairs using only mined data.
Demonstrated effective translation quality for distant language pairs like Russian/Japanese.
Abstract
We show that margin-based bitext mining in a multilingual sentence space can be applied to monolingual corpora of billions of sentences. We are using ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences. Using one unified approach for 38 languages, we were able to mine 4.5 billions parallel sentences, out of which 661 million are aligned with English. 20 language pairs have more then 30 million parallel sentences, 112 more then 10 million, and most more than one million, including direct alignments between many European or Asian languages. To evaluate the quality of the mined bitexts, we train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Using our mined bitexts only and no human translated parallel data, we achieve a new state-of-the-art for a single system on the WMT'19 test set for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest
