Exploring Differences between Human Perception and Model Inference in Audio Event Recognition
Yizhou Tan, Yanru Wu, Yuanbo Hou, Xin Xu, Hui Bu, Shengchen Li, Dick, Botteldooren, Mark D. Plumbley

TL;DR
This paper investigates the discrepancies between human auditory perception and model inference in audio event recognition, introducing a new dataset and analysis to understand how models differ from human perception in identifying and detecting audio events.
Contribution
The paper presents the MAFAR dataset with multi-annotator labels, and analyzes the differences between human perception and model inference in semantic importance and event detection.
Findings
Humans ignore subtle or trivial events in semantic identification.
Models are affected by noisy events and tend to be more sensitive in event detection.
Significant gap exists between human perception and model inference in AER.
Abstract
Audio Event Recognition (AER) traditionally focuses on detecting and identifying audio events. Most existing AER models tend to detect all potential events without considering their varying significance across different contexts. This makes the AER results detected by existing models often have a large discrepancy with human auditory perception. Although this is a critical and significant issue, it has not been extensively studied by the Detection and Classification of Sound Scenes and Events (DCASE) community because solving it is time-consuming and labour-intensive. To address this issue, this paper introduces the concept of semantic importance in AER, focusing on exploring the differences between human perception and model inference. This paper constructs a Multi-Annotated Foreground Audio Event Recognition (MAFAR) dataset, which comprises audio recordings labelled by 10 professional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
