TL;DR
This paper introduces a transformer-based framework for sound event localization that captures temporal dependencies and models uncertainty in source positions, outperforming existing methods on multiple datasets.
Contribution
The paper presents a novel transformer architecture for sound localization that incorporates uncertainty modeling via Gaussian representations, surpassing prior recurrent neural network approaches.
Findings
Outperforms state-of-the-art methods on all tested datasets.
Effectively models uncertainty in source localization.
Achieves statistically significant improvements in accuracy.
Abstract
Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms. Additionally, the estimated sound event positions are represented as multivariate Gaussian variables, yielding an additional notion of uncertainty, which many previously proposed deep learning-based systems designed for this application do not provide. The framework is evaluated on three publicly available multi-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
