Binaural Sound Event Localization and Detection Neural Network based on HRTF Localization Cues for Humanoid Robots
Gyeong-Tae Lee

TL;DR
This paper introduces BiSELDnet, a neural network that uses novel binaural features and HRTF cues to improve sound event localization and detection in humanoid robots, especially in challenging environments.
Contribution
It proposes a new binaural time-frequency feature set and a neural network architecture that enhances sound localization accuracy and robustness over existing models.
Findings
BiSELDnet outperforms state-of-the-art models in urban noise conditions.
BTFF features improve elevation and front-back discrimination.
VAM visualization confirms focus on N1 notch for elevation estimation.
Abstract
Humanoid robots require simultaneous sound event type and direction estimation for situational awareness, but conventional two-channel input struggles with elevation estimation and front-back confusion. This paper proposes a binaural sound event localization and detection (BiSELD) neural network to address these challenges. BiSELDnet learns time-frequency patterns and head-related transfer function (HRTF) localization cues from binaural input features. A novel eight-channel binaural time-frequency feature (BTFF) is introduced, comprising left/right mel-spectrograms, V-maps, an interaural time difference (ITD) map (below 1.5 kHz), an interaural level difference (ILD) map (above 5 kHz with front-back asymmetry), and spectral cue (SC) maps (above 5 kHz for elevation). The effectiveness of BTFF was confirmed across omnidirectional, horizontal, and median planes. BiSELDnets, particularly one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
