AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis
Sebastian Sager, Benjamin Elizalde, Damian Borth, Christian, Schulze, Bhiksha Raj, Ian Lane

TL;DR
This paper introduces AudioPairBank, a large dataset of audio with adjective-noun and verb-noun pairs, and demonstrates its potential for nuanced sound recognition with a 70% accuracy benchmark.
Contribution
It provides the first dataset with adjective-noun and verb-noun labels for audio and analyzes their correlation with sound content.
Findings
Collected and processed 33,000+ audio files with 1,123 label pairs.
Achieved 70% accuracy in recognizing audio content with these labels.
Documented challenges and implications of collecting nuanced audio annotations.
Abstract
Recently, sound recognition has been used to identify sounds, such as car and river. However, sounds have nuances that may be better described by adjective-noun pairs such as slow car, and verb-noun pairs such as flying insects, which are under explored. Therefore, in this work we investigate the relation between audio content and both adjective-noun pairs and verb-noun pairs. Due to the lack of datasets with these kinds of annotations, we collected and processed the AudioPairBank corpus consisting of a combined total of 1,123 pairs and over 33,000 audio files. One contribution is the previously unavailable documentation of the challenges and implications of collecting audio recordings with these type of labels. A second contribution is to show the degree of correlation between the audio content and the labels through sound recognition experiments, which yielded results of 70% accuracy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
