AuralNet: Hierarchical Attention-based 3D Binaural Localization of Overlapping Speakers

Linya Fu; Yu Liu; Zhijie Liu; Zedong Yang; Zhong-Qiu Wang; Youfu Li; and He Kong

arXiv:2506.02773·eess.AS·June 4, 2025

AuralNet: Hierarchical Attention-based 3D Binaural Localization of Overlapping Speakers

Linya Fu, Yu Liu, Zhijie Liu, Zedong Yang, Zhong-Qiu Wang, Youfu Li, and He Kong

PDF

Open Access

TL;DR

AuralNet is a hierarchical attention-based neural network that accurately localizes multiple overlapping sound sources in 3D space using binaural signals, even in noisy and reverberant environments.

Contribution

It introduces a novel hierarchical architecture with attention mechanisms for multi-source 3D localization without prior source count knowledge.

Findings

01

Outperforms recent localization methods in noisy-reverberant settings

02

Effectively localizes overlapping sources in azimuth and elevation

03

Robust to environmental noise and reverberation

Abstract

We propose AuralNet, a novel 3D multi-source binaural sound source localization approach that localizes overlapping sources in both azimuth and elevation without prior knowledge of the number of sources. AuralNet employs a gated coarse-tofine architecture, combining a coarse classification stage with a fine-grained regression stage, allowing for flexible spatial resolution through sector partitioning. The model incorporates a multi-head self-attention mechanism to capture spatial cues in binaural signals, enhancing robustness in noisy-reverberant environments. A masked multi-task loss function is designed to jointly optimize sound detection, azimuth, and elevation estimation. Extensive experiments in noisy-reverberant conditions demonstrate the superiority of AuralNet over recent methods

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis