Ukrainian-to-English folktale corpus: Parallel corpus creation and augmentation for machine translation in low-resource languages
Olena Burda-Lassen

TL;DR
This paper introduces a new Ukrainian-English folktale parallel corpus, enhancing resources for machine translation in low-resource languages by combining domain-specific curation and augmentation methods.
Contribution
It presents a novel Ukrainian-English folktale corpus with alignment and augmentation strategies tailored for machine translation in low-resource settings.
Findings
Corpus is word and sentence-aligned for optimal meaning preservation.
Augmentation improves the size and diversity of the dataset.
Resource supports better machine translation for low-resource languages.
Abstract
Folktales are linguistically very rich and culturally significant in understanding the source language. Historically, only human translation has been used for translating folklore. Therefore, the number of translated texts is very sparse, which limits access to knowledge about cultural traditions and customs. We have created a new Ukrainian-To-English parallel corpus of familiar Ukrainian folktales based on available English translations and suggested several new ones. We offer a combined domain-specific approach to building and augmenting this corpus, considering the nature of the domain and differences in the purpose of human versus machine translation. Our corpus is word and sentence-aligned, allowing for the best curation of meaning, specifically tailored for use as training data for machine translation models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Folklore, Mythology, and Literature Studies
