Sicilian Translator: A Recipe for Low-Resource NMT

Eryk Wdowiak

arXiv:2110.01938·cs.CL·October 6, 2021·1 cites

Sicilian Translator: A Recipe for Low-Resource NMT

Eryk Wdowiak

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper presents a neural machine translation system for Sicilian-English using a small dataset, innovative vocabulary biasing, and data augmentation techniques, achieving competitive BLEU scores despite limited resources.

Contribution

It introduces a low-resource NMT approach for Sicilian, combining dataset augmentation, vocabulary biasing, and theoretical insights to improve translation quality.

Findings

01

BLEU scores reached the upper 20s with basic training.

02

Scores improved into the mid 30s with backtranslation and multilingual data.

03

Vocabulary biasing towards linguistic features enhanced translation performance.

Abstract

With 17,000 pairs of Sicilian-English translated sentences, Arba Sicula developed the first neural machine translator for the Sicilian language. Using small subword vocabularies, we trained small Transformer models with high dropout parameters and achieved BLEU scores in the upper 20s. Then we supplemented our dataset with backtranslation and multilingual translation and pushed our scores into the mid 30s. We also attribute our success to incorporating theoretical information in our dataset. Prior to training, we biased the subword vocabulary towards the desinences one finds in a textbook. And we included textbook exercises in our dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ewdowiak/Sicilian_Translator
mxnetOfficial

Models

🤗
gngpostalsrvc/BERiT
model· 40 dl
40 dl

Datasets

Napizia/Good-Sicilian-in-NLLB
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Softmax