Data Selection with Feature Decay Algorithms Using an Approximated   Target Side

Alberto Poncelas; Gideon Maillette de Buy Wenniger; Andy Way

arXiv:1811.03039·cs.CL·November 8, 2018·5 cites

Data Selection with Feature Decay Algorithms Using an Approximated Target Side

Alberto Poncelas, Gideon Maillette de Buy Wenniger, Andy Way

PDF

Open Access

TL;DR

This paper enhances data selection for neural machine translation by incorporating an approximated target side via pre-translation into Feature Decay Algorithms, improving translation quality over traditional source-only methods.

Contribution

It introduces a novel approach of using an approximated target side in FDA for data selection, leading to better translation performance.

Findings

01

Models with combined source and approximated target data outperform source-only models.

02

Significant BLEU score improvements of over 1.5 points over full data training.

03

Statistically significant gains over strong FDA baselines.

Abstract

Data selection techniques applied to neural machine translation (NMT) aim to increase the performance of a model by retrieving a subset of sentences for use as training data. One of the possible data selection techniques are transductive learning methods, which select the data based on the test set, i.e. the document to be translated. A limitation of these methods to date is that using the source-side test set does not by itself guarantee that sentences are selected with correct translations, or translations that are suitable given the test-set domain. Some corpora, such as subtitle corpora, may contain parallel sentences with inaccurate translations caused by localization or length restrictions. In order to try to fix this problem, in this paper we propose to use an approximated target-side in addition to the source-side when selecting suitable sentence-pairs for training a model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification