Milimili. Collecting Parallel Data via Crowdsourcing

Alexander Antonov

arXiv:2307.12282·cs.CL·July 25, 2023

Milimili. Collecting Parallel Data via Crowdsourcing

Alexander Antonov

PDF

Open Access 1 Repo

TL;DR

This paper introduces a crowdsourcing-based method for collecting parallel corpora, providing a cost-effective alternative to professional translation, and shares experimental data for Chechen-Russian and Fula-English language pairs.

Contribution

It presents a novel crowdsourcing methodology for parallel data collection and releases new experimental datasets for under-resourced language pairs.

Findings

01

Crowdsourcing is a viable, cost-effective approach for collecting parallel corpora.

02

Experimental datasets for Chechen-Russian and Fula-English are now available.

03

The quality of crowdsourced data varies but can be useful for NLP tasks.

Abstract

We present a methodology for gathering a parallel corpus through crowdsourcing, which is more cost-effective than hiring professional translators, albeit at the expense of quality. Additionally, we have made available experimental parallel data collected for Chechen-Russian and Fula-English language pairs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alantonov/milimili
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling