Volctrans Parallel Corpus Filtering System for WMT 2020

Runxin Xu; Zhuo Zhi; Jun Cao; Mingxuan Wang; Lei Li

arXiv:2010.14029·cs.CL·October 28, 2020·1 cites

Volctrans Parallel Corpus Filtering System for WMT 2020

Runxin Xu, Zhuo Zhi, Jun Cao, Mingxuan Wang, Lei Li

PDF

Open Access

TL;DR

The paper presents Volctrans, a system for filtering and aligning parallel sentences in low-resource conditions, achieving top performance in the WMT20 shared task.

Contribution

Introduces a novel parallel corpus filtering system with iterative mining and XLM-based scoring, outperforming baselines in low-resource language pairs.

Findings

01

Outperforms baseline by 3.x/2.x and 2.x/2.x in km-en and ps-en.

02

Achieved highest scores among all submissions in WMT20.

03

Effective in low-resource parallel corpus filtering and alignment.

Abstract

In this paper, we describe our submissions to the WMT20 shared task on parallel corpus filtering and alignment for low-resource conditions. The task requires the participants to align potential parallel sentence pairs out of the given document pairs, and score them so that low-quality pairs can be filtered. Our system, Volctrans, is made of two modules, i.e., a mining module and a scoring module. Based on the word alignment model, the mining module adopts an iterative mining strategy to extract latent parallel sentences. In the scoring module, an XLM-based scorer provides scores, followed by reranking mechanisms and ensemble. Our submissions outperform the baseline by 3.x/2.x and 2.x/2.x for km-en and ps-en on From Scratch/Fine-Tune conditions, which is the highest among all submissions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification