Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021
S\'eamus Lankford, Haithem Afli, Andy Way

TL;DR
This paper develops and evaluates domain-specific machine translation models for Covid-related English-Irish translation, demonstrating significant BLEU score improvements through dataset extension and domain adaptation techniques.
Contribution
It introduces a Covid-specific dataset and compares various fine-tuning approaches, showing the effectiveness of dataset extension for domain adaptation in translation models.
Findings
Extending the in-domain dataset by 5k lines improved BLEU score by 27 points.
The highest-performing model used a Transformer architecture trained on an extended Covid dataset.
Domain adaptation techniques enhanced translation quality for Covid-related content.
Abstract
Translation models for the specific domain of translating Covid data from English to Irish were developed for the LoResMT 2021 shared task. Domain adaptation techniques, using a Covid-adapted generic 55k corpus from the Directorate General of Translation, were applied. Fine-tuning, mixed fine-tuning and combined dataset approaches were compared with models trained on an extended in-domain dataset. As part of this study, an English-Irish dataset of Covid related data, from the Health and Education domains, was developed. The highest-performing model used a Transformer architecture trained with an extended in-domain Covid dataset. In the context of this study, we have demonstrated that extending an 8k in-domain baseline dataset by just 5k lines improved the BLEU score by 27 points.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Dropout · Multi-Head Attention · Softmax · Dense Connections · Label Smoothing · Adam
