TL;DR
This paper introduces VocalSound, a large and diverse dataset of human vocal sounds with rich metadata, significantly enhancing recognition accuracy and supporting research in health and speech applications.
Contribution
The creation of VocalSound, a comprehensive dataset with over 21,000 recordings and detailed metadata, addressing limitations of previous datasets for vocal sound recognition.
Findings
Adding VocalSound improves recognition accuracy by 41.9%.
The dataset's metadata enables demographic and health-related analysis.
VocalSound enhances robustness of vocal sound classification models.
Abstract
Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring. However, existing datasets have a relatively small number of vocal sound samples or noisy labels. As a consequence, state-of-the-art audio event classification models may not perform well in detecting human vocal sounds. To support research on building robust and accurate vocal sound recognition, we have created a VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. Experiments show that the vocal sound recognition performance of a model can be significantly improved by 41.9% by adding VocalSound dataset to an existing dataset as training material. In addition, different from previous datasets, the VocalSound dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
