Sicilian Translator: A Recipe for Low-Resource NMT
Eryk Wdowiak

TL;DR
This paper presents a neural machine translation system for Sicilian-English using a small dataset, innovative vocabulary biasing, and data augmentation techniques, achieving competitive BLEU scores despite limited resources.
Contribution
It introduces a low-resource NMT approach for Sicilian, combining dataset augmentation, vocabulary biasing, and theoretical insights to improve translation quality.
Findings
BLEU scores reached the upper 20s with basic training.
Scores improved into the mid 30s with backtranslation and multilingual data.
Vocabulary biasing towards linguistic features enhanced translation performance.
Abstract
With 17,000 pairs of Sicilian-English translated sentences, Arba Sicula developed the first neural machine translator for the Sicilian language. Using small subword vocabularies, we trained small Transformer models with high dropout parameters and achieved BLEU scores in the upper 20s. Then we supplemented our dataset with backtranslation and multilingual translation and pushed our scores into the mid 30s. We also attribute our success to incorporating theoretical information in our dataset. Prior to training, we biased the subword vocabulary towards the desinences one finds in a textbook. And we included textbook exercises in our dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Softmax
