scb-mt-en-th-2020: A Large English-Thai Parallel Corpus
Lalita Lowphansirikul, Charin Polpanumas, Attapol T. Rutherford and, Sarana Nutanong

TL;DR
This paper presents a large-scale English-Thai parallel corpus with over 1 million sentence pairs, enabling improved machine translation models that outperform existing services when trained with this dataset.
Contribution
The creation of a comprehensive, publicly available English-Thai dataset and the demonstration of its effectiveness in training superior machine translation models.
Findings
Models trained on this dataset outperform Google Translate with additional data.
The dataset includes diverse sources like news, Wikipedia, and government documents.
Reproducible methodology for data collection and noise removal is provided.
Abstract
The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
