Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard, Elia Formisano, Michele Esposito, Michel Dumontier

TL;DR
This survey reviews 69 audio-language datasets, analyzing their characteristics, challenges, and opportunities to improve the development of more diverse and effective audio-language models.
Contribution
It provides a comprehensive analysis of existing datasets, evaluates their variability and biases, and discusses key challenges and opportunities for future dataset development.
Findings
AudioSet has over two million samples from YouTube.
Freesound contains over 1 million samples from community contributions.
Identified biases and imbalances in sound categories and language representation.
Abstract
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (https://github.com/GLJS/audio-datasets). It provides a comprehensive analysis of datasets origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets like AudioSet with over two million samples, and community platforms like Freesound with over 1 million samples. Through principal component analysis of audio and text embeddings, the survey evaluates the acoustic and linguistic variability across datasets. It also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies
