ARCA23K: An audio dataset for investigating open-set label noise
Turab Iqbal, Yin Cao, Andrew Bailey, Mark D. Plumbley, Wenwu Wang

TL;DR
ARCA23K is a large, curated audio dataset designed to facilitate the study of open-set label noise in sound classification, enabling more controlled research on label errors and their effects.
Contribution
The paper introduces ARCA23K, a reproducible audio dataset that specifically addresses open-set label noise, which is less explored in existing datasets.
Findings
Most label errors are due to out-of-vocabulary audio clips.
Open-set label noise impacts classification performance.
Dataset creation process is fully reproducible.
Abstract
The availability of audio data on sound sharing platforms such as Freesound gives users access to large amounts of annotated audio. Utilising such data for training is becoming increasingly popular, but the problem of label noise that is often prevalent in such datasets requires further investigation. This paper introduces ARCA23K, an Automatically Retrieved and Curated Audio dataset comprised of over 23000 labelled Freesound clips. Unlike past datasets such as FSDKaggle2018 and FSDnoisy18K, ARCA23K facilitates the study of label noise in a more controlled manner. We describe the entire process of creating the dataset such that it is fully reproducible, meaning researchers can extend our work with little effort. We show that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and we refer to this type of label noise as open-set label noise. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
