The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation
Mai Oudah, Amjad Almahairi, Nizar Habash

TL;DR
This paper systematically compares how different tokenization techniques affect Arabic-English statistical and neural machine translation, revealing that optimal choices depend on model type and data size, and that combining models improves results.
Contribution
It provides a comprehensive analysis of tokenization impacts on both neural and statistical MT for Arabic-English, highlighting the importance of data and model considerations.
Findings
Tokenization choice depends on model type and data size.
Combining neural and statistical MT outputs yields significant improvements.
Optimal tokenization schemes vary with data and model.
Abstract
Neural networks have become the state-of-the-art approach for machine translation (MT) in many languages. While linguistically-motivated tokenization techniques were shown to have significant effects on the performance of statistical MT, it remains unclear if those techniques are well suited for neural MT. In this paper, we systematically compare neural and statistical MT models for Arabic-English translation on data preprecossed by various prominent tokenization schemes. Furthermore, we consider a range of data and vocabulary sizes and compare their effect on both approaches. Our empirical results show that the best choice of tokenization scheme is largely based on the type of model and the size of data. We also show that we can gain significant improvements using a system selection that combines the output from neural and statistical MT.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
