Machine translation training data for English–Tshivenḓa

Tanja Gaustad; Cindy A. McKellar; Martin J. Puttkammer

PMC · DOI:10.1016/j.dib.2024.110898·September 7, 2024

Machine translation training data for English–Tshivenḓa

Tanja Gaustad, Cindy A. McKellar, Martin J. Puttkammer

PDF

Open Access

TL;DR

This paper introduces a new dataset for translating between English and Tshivenḓa, collected from government websites and translated by professionals.

Contribution

The paper presents a novel English–Tshivenḓa machine translation dataset with parallel and monolingual data.

Findings

01

The dataset includes parallel and monolingual data collected from government sites and professional translations.

02

The corpus can be used for machine translation and other Tshivenḓa language technologies.

03

Word counts and corpus cleanup methods are detailed for transparency and reproducibility.

Abstract

This data article describes a machine translation training data set for translation between English and Tshivenḓa. The data set contains parallel, aligned English–Tshivenḓa data as well as monolingual Tshivenḓa data. The data was collected from both web crawling of multilingual South African government sites and matched documents from translators or publishing sources. Additional unique data was translated from English into Tshivenḓa by professional translators to increase the total corpus size. This article contains information about the collection and translation of the data as well as how alignments and corpus cleanup were done. The wordcounts of the corpus are also given. In addition to training machine translation systems this data can also be used for the development of other Tshivenḓa core technologies as well as for linguistic studies.

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques