Building Subject-aligned Comparable Corpora and Mining it for Truly   Parallel Sentence Pairs

Krzysztof Wo{\l}k; Krzysztof Marasek

arXiv:1509.08881·cs.CL·September 30, 2015

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Krzysztof Wo{\l}k, Krzysztof Marasek

PDF

TL;DR

This paper presents a novel methodology for mining truly parallel sentence pairs from subject-aligned comparable corpora, specifically Wikipedia, using web crawling, filtering techniques, and machine translation-based similarity measures.

Contribution

It introduces a web crawling approach for building subject-aligned corpora and a filtering method leveraging machine translation to extract high-quality parallel sentences.

Findings

01

Successfully built subject-aligned corpora from Wikipedia

02

Developed a filtering method that improves parallel sentence extraction

03

Enhanced machine translation systems with mined parallel data

Abstract

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.