Benchmarking Azerbaijani Neural Machine Translation

Chih-Chen Chen; William Chen

arXiv:2207.14473·cs.CL·August 1, 2022

Benchmarking Azerbaijani Neural Machine Translation

Chih-Chen Chen, William Chen

PDF

Open Access

TL;DR

This paper evaluates Azerbaijani-English neural machine translation, comparing segmentation techniques and domain performance, revealing that Unigram segmentation enhances results and dataset quality impacts model scaling, but cross-domain translation remains difficult.

Contribution

It provides the first comprehensive benchmark of Azerbaijani NMT, analyzing segmentation methods and domain generalization, highlighting key factors affecting translation quality.

Findings

01

Unigram segmentation improves NMT performance

02

Model scaling benefits more from data quality than quantity

03

Cross-domain generalization remains a challenge

Abstract

Little research has been done on Neural Machine Translation (NMT) for Azerbaijani. In this paper, we benchmark the performance of Azerbaijani-English NMT systems on a range of techniques and datasets. We evaluate which segmentation techniques work best on Azerbaijani translation and benchmark the performance of Azerbaijani NMT models across several domains of text. Our results show that while Unigram segmentation improves NMT performance and Azerbaijani translation models scale better with dataset quality than quantity, cross-domain generalization remains a challenge

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsUnigram Segmentation