Using Multiple Subwords to Improve English-Esperanto Automated Literary   Translation Quality

Alberto Poncelas; Jan Buts; James Hadley; Andy Way

arXiv:2011.14190·cs.CL·December 1, 2020

Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Alberto Poncelas, Jan Buts, James Hadley, Andy Way

PDF

Open Access

TL;DR

This paper proposes using multiple subword segmentations with Byte Pair Encoding to enhance English-Esperanto machine translation quality, especially in low-resource settings, and provides a new literary domain parallel dataset.

Contribution

It introduces a novel data augmentation technique using multiple subword splits and releases a new English-Esperanto literary parallel corpus.

Findings

01

Improved translation quality with multiple subword models

02

Effective data expansion for low-resource language pairs

03

Enhanced MT performance in literary domain

Abstract

Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with high-resource languages. When data are scarce, it is of paramount importance to make optimal use of the limited material available. To that end, in this paper we propose employing the same parallel sentences multiple times, only changing the way the words are split each time. For this purpose we use several Byte Pair Encoding models, with various merge operations used in their configuration. In our experiments, we use this technique to expand the available data and improve an MT system involving a low-resource language pair, namely English-Esperanto. As an additional contribution, we made available a set of English-Esperanto parallel data in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems