Milimili. Collecting Parallel Data via Crowdsourcing
Alexander Antonov

TL;DR
This paper introduces a crowdsourcing-based method for collecting parallel corpora, providing a cost-effective alternative to professional translation, and shares experimental data for Chechen-Russian and Fula-English language pairs.
Contribution
It presents a novel crowdsourcing methodology for parallel data collection and releases new experimental datasets for under-resourced language pairs.
Findings
Crowdsourcing is a viable, cost-effective approach for collecting parallel corpora.
Experimental datasets for Chechen-Russian and Fula-English are now available.
The quality of crowdsourced data varies but can be useful for NLP tasks.
Abstract
We present a methodology for gathering a parallel corpus through crowdsourcing, which is more cost-effective than hiring professional translators, albeit at the expense of quality. Additionally, we have made available experimental parallel data collected for Chechen-Russian and Fula-English language pairs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling
