Machine translation training data for English–Tshivenḓa
Tanja Gaustad, Cindy A. McKellar, Martin J. Puttkammer

TL;DR
This paper introduces a new dataset for translating between English and Tshivenḓa, collected from government websites and translated by professionals.
Contribution
The paper presents a novel English–Tshivenḓa machine translation dataset with parallel and monolingual data.
Findings
The dataset includes parallel and monolingual data collected from government sites and professional translations.
The corpus can be used for machine translation and other Tshivenḓa language technologies.
Word counts and corpus cleanup methods are detailed for transparency and reproducibility.
Abstract
This data article describes a machine translation training data set for translation between English and Tshivenḓa. The data set contains parallel, aligned English–Tshivenḓa data as well as monolingual Tshivenḓa data. The data was collected from both web crawling of multilingual South African government sites and matched documents from translators or publishing sources. Additional unique data was translated from English into Tshivenḓa by professional translators to increase the total corpus size. This article contains information about the collection and translation of the data as well as how alignments and corpus cleanup were done. The wordcounts of the corpus are also given. In addition to training machine translation systems this data can also be used for the development of other Tshivenḓa core technologies as well as for linguistic studies.
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
