CrowdSpeech and VoxDIY: Benchmark Datasets for Crowdsourced Audio Transcription
Nikita Pavlichenko, Ivan Stelmakh, Dmitry Ustalov

TL;DR
This paper introduces CrowdSpeech and VoxDIY, large-scale datasets for crowdsourced audio transcription, and proposes a principled methodology for reliable data collection and aggregation in speech recognition tasks.
Contribution
It provides the first large-scale crowdsourced audio transcription datasets and a general pipeline for data collection applicable to new domains and languages.
Findings
Existing aggregation methods have room for improvement.
The proposed pipeline is effective for under-resourced languages.
Open-source code enables replication and further research.
Abstract
Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life. In simple problems such as image classification, crowdsourcing has become one of the standard tools for cheap and time-efficient data collection: thanks in large part to advances in research on aggregation methods. However, the applicability of crowdsourcing to more complex tasks (e.g., speech recognition) remains limited due to the lack of principled aggregation methods for these modalities. The main obstacle towards designing aggregation methods for more advanced applications is the absence of training data, and in this work, we focus on bridging this gap in speech recognition. For this, we collect and release CrowdSpeech -- the first publicly available large-scale dataset of crowdsourced audio transcriptions. Evaluation of existing and novel aggregation methods on our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Mobile Crowdsensing and Crowdsourcing · Speech and Audio Processing
