Binaural Signal Representations for Joint Sound Event Detection and Acoustic Scene Classification
Daniel Aleksander Krause, Annamaria Mesaros

TL;DR
This paper explores the use of binaural spatial audio features in a joint deep learning model to improve sound event detection and acoustic scene classification, demonstrating that specific features enhance performance.
Contribution
It introduces the use of binaural features like GCC-phat and phase differences in a joint DNN for SED and ASC, showing improved results over baseline methods.
Findings
Binaural features improve SED and ASC performance.
Joint training benefits from spatial audio features.
Specific binaural features outperform logmel energies.
Abstract
Sound event detection (SED) and Acoustic scene classification (ASC) are two widely researched audio tasks that constitute an important part of research on acoustic scene analysis. Considering shared information between sound events and acoustic scenes, performing both tasks jointly is a natural part of a complex machine listening system. In this paper, we investigate the usefulness of several spatial audio features in training a joint deep neural network (DNN) model performing SED and ASC. Experiments are performed for two different datasets containing binaural recordings and synchronous sound event and acoustic scene labels to analyse the differences between performing SED and ASC separately or jointly. The presented results show that the use of specific binaural features, mainly the Generalized Cross Correlation with Phase Transform (GCC-phat) and sines and cosines of phase…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Diverse Musicological Studies
