Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data
Rajat Bhatnagar, Ananya Ganesh, Katharina Kann

TL;DR
This paper introduces a cost-effective method for collecting parallel translation data from monolingual speakers using GIFs as a pivot, demonstrating higher quality data collection compared to image-based methods.
Contribution
The paper presents a novel, simple approach for crowdsourcing machine translation data without bilingual speakers, utilizing GIFs to improve data quality.
Findings
GIF-based data collection yields higher quality sentence pairs.
The method is effective across Hindi, Tamil, and English.
Finetuning mBART on GIF-collected data improves translation performance.
Abstract
High-performing machine translation (MT) systems can help overcome language barriers while making it possible for everyone to communicate and use language technologies in the language of their choice. However, such systems require large amounts of parallel sentences for training, and translators can be difficult to find and expensive. Here, we present a data collection strategy for MT which, in contrast, is cheap and simple, as it does not require bilingual speakers. Based on the insight that humans pay specific attention to movements, we use graphics interchange formats (GIFs) as a pivot to collect parallel sentences from monolingual annotators. We use our strategy to collect data in Hindi, Tamil and English. As a baseline, we also collect data using images as a pivot. We perform an intrinsic evaluation by manually evaluating a subset of the sentence pairs and an extrinsic evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsmBART
