Extracting Parallel Paragraphs from Common Crawl
Jakub K\'udela, Irena Holubov\'a, Ond\v{r}ej Bojar

TL;DR
This paper introduces a scalable method combining bilingual embeddings and locality-sensitive hashing to extract parallel text segments from web pages regardless of their structure, validated on large web-crawled datasets.
Contribution
It presents a novel approach that does not rely on page structure assumptions, enabling extraction of parallel data from diverse web sources at scale.
Findings
Method effectively realigns segments from large parallel corpora.
Scales to hundreds of terabytes of web data.
Validated on real-world Common Crawl data.
Abstract
Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based on a combination of bivec (a bilingual extension of word2vec) and locality-sensitive hashing which allows us to efficiently identify pairs of parallel segments located anywhere on pages of a given web domain, regardless their structure. We validate our method on realigning segments from a large parallel corpus. Another experiment with real-world data provided by Common Crawl Foundation confirms that our solution scales to hundreds of terabytes large set of web-crawled data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
