JUMT at WMT2019 News Translation Task: A Hybrid approach to Machine   Translation for Lithuanian to English

Sainik Kumar Mahata; Avishek Garain; Adityar Rayala; Dipankar Das,; Sivaji Bandyopadhyay

arXiv:1908.01349·cs.CL·August 6, 2019

JUMT at WMT2019 News Translation Task: A Hybrid approach to Machine Translation for Lithuanian to English

Sainik Kumar Mahata, Avishek Garain, Adityar Rayala, Dipankar Das,, Sivaji Bandyopadhyay

PDF

TL;DR

This paper describes a hybrid neural and statistical machine translation system for Lithuanian to English news translation, achieving a BLEU score of 17.6 at WMT 2019.

Contribution

It introduces a combined neural and statistical approach with post-editing for Lithuanian-English translation, a novel hybrid method for this language pair.

Findings

01

Achieved BLEU score of 17.6

02

Demonstrated effectiveness of hybrid translation approach

03

Provided detailed system architecture and module descriptions

Abstract

In the current work, we present a description of the system submitted to WMT 2019 News Translation Shared task. The system was created to translate news text from Lithuanian to English. To accomplish the given task, our system used a Word Embedding based Neural Machine Translation model to post edit the outputs generated by a Statistical Machine Translation model. The current paper documents the architecture of our model, descriptions of the various modules and the results produced using the same. Our system garnered a BLEU score of 17.6.

Tables2

Table 1. Table 1: Statistics of the Lithuanian-English parallel corpus provided by the organizers. ”#” depicts No. of. ”Lt” and ”En” depict Lithuanian and English, respectively. ”vocab” means vocabulary of unique tokens.

# sentences in Lt corpus	9,62,022
# sentences in En corpus	9,62,022
# words in Lt corpus	1,16,65,937
# words in En corpus	1,56,22,488
# word vocab size for Lt corpus	4,88,593
# word vocab size for En corpus	2,27,131

Table 2. Table 2: Accuracy scores calculated using various autmoated evaluation metrics.

Metric	Score
BLEU	17.6
BLEU-cased	16.6
TER	0.762
BEER 2.0	0.497
CharactTER	0.718

Equations14

X = {x \textsubscript 1, x \textsubscript 2, ..., x \textsubscript n}

X = {x \textsubscript 1, x \textsubscript 2, ..., x \textsubscript n}

Y = {y \textsubscript 1, y \textsubscript 2, ..., y \textsubscript m}

Y = {y \textsubscript 1, y \textsubscript 2, ..., y \textsubscript m}

h \textsubscript t = f \textsubscript e n c (E \textsubscript x (x \textsubscript t), h \textsubscript t - 1)

h \textsubscript t = f \textsubscript e n c (E \textsubscript x (x \textsubscript t), h \textsubscript t - 1)

s \textsubscript t = f \textsubscript d ec (E \textsubscript y (y \textsubscript t - 1), s \textsubscript t - 1, c \textsubscript t)

s \textsubscript t = f \textsubscript d ec (E \textsubscript y (y \textsubscript t - 1), s \textsubscript t - 1, c \textsubscript t)

(y \textsubscript t = k ∣ y < t, X) = \frac{1}{Z} e x p (o u t \textsubscript k (E \textsubscript y (y \textsubscript t - 1), s \textsubscript t, c \textsubscript t))

(y \textsubscript t = k ∣ y < t, X) = \frac{1}{Z} e x p (o u t \textsubscript k (E \textsubscript y (y \textsubscript t - 1), s \textsubscript t, c \textsubscript t))

\sum \textsubscript j e x p (o u t \textsubscript j (E \textsubscript y (y \textsubscript t - 1), s \textsubscript t, c \textsubscript t))

\sum \textsubscript j e x p (o u t \textsubscript j (E \textsubscript y (y \textsubscript t - 1), s \textsubscript t, c \textsubscript t))

L = - \frac{1}{N} n = 1 \sum N t = 1 \sum T \textsubscript y \textsuperscript n l o g p (y \textsubscript t = y \textsubscript t \textsuperscript n, y \textsubscript < t \textsuperscript n, X \textsuperscript n)

L = - \frac{1}{N} n = 1 \sum N t = 1 \sum T \textsubscript y \textsuperscript n l o g p (y \textsubscript t = y \textsubscript t \textsuperscript n, y \textsubscript < t \textsuperscript n, X \textsuperscript n)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

JUMT at WMT2019 News Translation Task: A Hybrid approach to Machine Translation for Lithuanian to English

Sainik Kumar Mahata, Avishek Garain, Adityar Rayala,

**Dipankar Das, Sivaji Bandyopadhyay

**Computer Science and Engineering

Jadavpur University, Kolkata, India

[email protected], [email protected], [email protected],

[email protected], [email protected]

Abstract

In the current work, we present a description of the system submitted to WMT 2019 News Translation Shared task. The system was created to translate news text from Lithuanian to English. To accomplish the given task, our system used a Word Embedding based Neural Machine Translation model to post edit the outputs generated by a Statistical Machine Translation model. The current paper documents the architecture of our model, descriptions of the various modules and the results produced using the same. Our system garnered a BLEU score of 17.6.

1 Introduction

Machine Translation (MT) is automated translation of one natural language to another using a computer. Translation, itself, is a very tough task for both humans as well as a computer. It requires a thorough understanding of the syntax and semantics of both the languages under consideration. For producing good translations, a MT system needs good quality and sufficient amount of parallel corpus Mahata et al. (2016, 2017).

In the modern context, MT systems can be categorized into Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). SMT has had its share in making MT very popular among the masses. It includes creating statistical models, whose input parameters are derived from the analysis of bilingual text corpora, created by professional translators Weaver (1955). The state-of-art for SMT is Moses Toolkit111http://www.statmt.org/moses/, created by Koehn et al. (2007), incorporates subcomponents like Language Model generation, Word Alignment and Phrase Table generation. Various works have been done in SMT Lopez (2008); Koehn (2009) and it has shown good results for many language pairs.

On the other hand NMT Bahdanau et al. (2014), though relatively new, has shown considerable improvements in the translation results when compared to SMT Mahata et al. (2018b). This includes better fluency of the output and better handling of the Out-of-Vocabulary problem. Unlike SMT, it doesn’t depend on alignment and phrasal unit translations Kalchbrenner and Blunsom (2013). On the contrary, it uses an Encoder-Decoder approach incorporating Recurrent Neural Cells Cho et al. (2014). As a result, when given sufficient amount of training data, it gives much more accurate results when compared to SMT Doherty et al. (2010); Vaswani et al. (2013); Liu et al. (2014).

For the given task222http://www.statmt.org/wmt19/translation-task.html, we attempted to create a MT system that can translate sentences from Lithuanian to English. Since, using only SMT or NMT models leads to some or the other disadvantages, we tried to use both in a pipeline. This leads to an improvement of the results over the individual usage of either SMT or NMT. The main idea was to train a SMT model for translating Lithuanian language to English. Thereafter, a test set was translated using this model. Then, a word embedding based NMT model was trained to learn the mappings between the SMT output (in English) and the gold standard data (in English).

The organizers provided the required parallel corpora, consisting of 9,62,022 sentence pairs, for training the translation model. Among this, 7,62,022 pairs was used to train the SMT system and 2,00,000 pairs were used to test the SMT system and then train the NMT system. The statistics of the parallel corpus is depicted in 1.

The remainder of the paper is organized as follows. Section 2 will describe the methodology of creating the SMT and the NMT model and will include the preprocessing steps, a brief summary of the encoder-decoder approach and the architecture of our system. This will be followed by the results and conclusion in Section 3 and 4, respectively.

2 Methodology

2.1 SMT

For designing the model we followed some standard preprocessing steps on 7,62,022 sentence pairs, which are discussed below.

2.1.1 Preprocessing

The following steps were applied to preprocess and clean the data before using it for training our Statistical machine translation model. We used the NLTK toolkit333https://www.nltk.org/ for performing the steps.

•

Tokenization: Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens. In our case, these tokens were words, punctuation marks, numbers. NLTK supports tokenization of Lithuanian as well as English texts.

•

Truecasing: This refers to the process of restoring case information to badly-cased or non-cased text Lita et al. (2003). Truecasing helps in reducing data sparsity.

•

Cleaning: Long sentences (No. of tokens $>80$ ) were removed.

2.1.2 Moses

Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair, when trained with a large collection of translated texts (parallel corpus). Once the model has been trained, an efficient search algorithm quickly finds the highest probability translation among the exponential number of choices.

We trained Moses using 7,62,022 sentence pairs provided by WMT2019, with Lithuanian as the source language and English as the target language. For building the Language Model we used KenLM444https://kheafield.com/code/kenlm/ Heafield (2011) with 7-grams from the target corpus. The English monolingual corpus from WMT2019 was used to build the language model

Training the Moses statistical MT system resulted in generation of Phrase Model and Translation Model that helps in translating between source-target language pairs. Moses scores the phrase in the phrase table with respect to a given source sentence and produces best scored phrases as output.

2.2 NMT

Neural machine translation (NMT) is an approach to machine translation that uses neural networks to predict the likelihood of a sequence of words. The main functionality of NMT is based on the sequence to sequence (seq2seq) architecture, which is described in Section 2.2.1.

2.2.1 Sequence to Sequence Model

Sequence to Sequence learning is a concept in neural networks, that helps it to learn sequences. Essentially, it takes as input a sequence of tokens (words in our case)

[TABLE]

and tries to generate the target sequence as output

[TABLE]

where xi and yi are the input and target symbols respectively.

Sequence to Sequence architecture consists of two parts, an Encoder and a Decoder.

The encoder takes a variable length sequence as input and encodes it into a fixed length vector, which is supposed to summarize its meaning and taking into account its context as well. A Long Short Term Memory (LSTM) cell was used to achieve this. The uni-directional encoder reads the words of the Lithuanian texts, as a sequence from one end to the other (left to right in our case),

[TABLE]

Here, Ex is the input embedding lookup table (dictionary), $\vec{f}$ enc is the transfer function for the LSTM recurrent unit. The cell state h and context vector C is constructed and is passed on to the decoder.

The decoder takes as input, the context vector C and the cell state h from the encoder, and computes the hidden state at time t as,

[TABLE]

Subsequently, a parametric function outk returns the conditional probability using the next target symbol $k$ .

[TABLE]

$Z$ is the normalizing constant,

[TABLE]

The entire model can be trained end-to-end by minimizing the log likelihood which is defined as

[TABLE]

where N is the number of sentence pairs, and Xn and ytn are the input sentence and the t-th target symbol in the n-th pair respectively.

The input to the decoder was one hot tensor (embeddings at word level) of 2,00,000 English sentences while the target data was identical, but with an offset of one time-step ahead.

2.3 Architecture

2.3.1 Training

For the training purpose, 7,62,202 , preprocessed, Lituanian-English sentence pairs were fed to Moses Toolkit. This created a SMT translation model with Lithuanian as the source language and English as the target language. Thereafter, we had 2,00,000 Lithuanian-English sentence pairs, from which the Lithuanian sentences were given as input to the SMT model and it gave 2,00,000 translated English sentences as output. Now, this 2,00,000 translated English sentences and the respective gold standard 2,00,000 sentences, from the Lithuanian-English sentence pair, were given as input to a word embedding based NMT model. As a result, this constituted our Hybrid model.

2.3.2 Testing

For the testing purpose, 10k Lithuanian Sentences were fed to the Hybrid model, and the output, when checked using BLEU Papineni et al. (2002), resulted in an accuracy of 21.6. The training and testing architecture is shown in Figure 1

3 Results

WMT2019 provided us with a test set of Lithuanian sentences in .SGM format. This file was parsed and fed to our hybrid system. The output file was again converted to .SGM format and submitted to the organizers. Our system garnered a BLEU Score of 17.6, when it was scored using automated accuracy metrics. Other accuracy scores are mentioned in Table 2.

4 Conclusion

The paper presents the working of the translation system submitted to WMT 2019 News Translation shared task. We have used Word Embedding based NMT on top of SMT, for our proposed system. We have used a single LSTM layer as an encoder as well as a decoder. As a future prospect, we plan to use more LSTM layers in our model. We plan to create another model that incrementally trains both the SMT and NMT systems in a pipeline to improve the translation quality.

Acknowledgement

The reported work is supported by Media Lab Asia, MeitY, Government of India, under the Visvesvaraya PhD Scheme for Electronics & IT.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473 .
2Bojar et al. (2018) Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2018. Findings of the 2018 conference on machine translation (WMT 18). In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers , Brussels, Belgium. Association for Computational Linguistics.
3Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. ar Xiv preprint ar Xiv:1409.1259 .
4Doherty et al. (2010) Stephen Doherty, Sharon O?Brien, and Michael Carl. 2010. Eye tracking as an mt evaluation technique. Machine translation , 24(1):1–13.
5Heafield (2011) Kenneth Heafield. 2011. Ken LM: faster and smaller language model queries . In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation , pages 187–197, Edinburgh, Scotland, United Kingdom.
6Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP , volume 3, page 413.
7Koehn (2009) Philipp Koehn. 2009. Statistical machine translation . Cambridge University Press.
8Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions , pages 177–180. Association for Computational Linguistics.