SALADnet: Self-Attentive multisource Localization in the Ambisonics Domain
Pierre-Amaury Grumiaux, Srdan Kitic, Prerak Srivastava, Laurent Girin,, Alexandre Gu\'erin

TL;DR
This paper introduces SALADnet, a self-attention neural network for multi-speaker localization in Ambisonics recordings, demonstrating improved performance and efficiency over traditional recurrent models.
Contribution
It replaces recurrent layers with self-attention encoders from Transformers, enhancing multi-source localization accuracy and computational efficiency.
Findings
Self-attention models outperform CRNN in multi-speaker scenarios.
Proposed models enable parallel processing, reducing execution time.
Models perform on par or better than state-of-the-art in synthetic and real data.
Abstract
In this work, we propose a novel self-attention based neural network for robust multi-speaker localization from Ambisonics recordings. Starting from a state-of-the-art convolutional recurrent neural network, we investigate the benefit of replacing the recurrent layers by self-attention encoders, inherited from the Transformer architecture. We evaluate these models on synthetic and real-world data, with up to 3 simultaneous speakers. The obtained results indicate that the majority of the proposed architectures either perform on par, or outperform the CRNN baseline, especially in the multisource scenario. Moreover, by avoiding the recurrent layers, the proposed models lend themselves to parallel computing, which is shown to produce considerable savings in execution time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
