Assessment of Self-Attention on Learned Features For Sound Event   Localization and Detection

Parthasaarathy Sudarsanam; Archontis Politis; Konstantinos Drossos

arXiv:2107.09388·cs.SD·September 28, 2021·5 cites

Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection

Parthasaarathy Sudarsanam, Archontis Politis, Konstantinos Drossos

PDF

Open Access

TL;DR

This paper investigates the impact of replacing RNNs with multi-head self-attention layers in sound event localization and detection models, demonstrating significant performance improvements on benchmark data.

Contribution

It provides a detailed analysis of how self-attention mechanisms enhance SELD models, including effects of stacking, attention heads, and positional encoding, surpassing traditional CRNN approaches.

Findings

01

Self-attention layers improve SELD performance significantly.

02

Stacking multiple attention blocks enhances accuracy.

03

Using multiple attention heads and positional encoding benefits model performance.

Abstract

Joint sound event localization and detection (SELD) is an emerging audio signal processing task adding spatial dimensions to acoustic scene analysis and sound event detection. A popular approach to modeling SELD jointly is using convolutional recurrent neural network (CRNN) models, where CNNs learn high-level features from multi-channel audio input and the RNNs learn temporal relationships from these high-level features. However, RNNs have some drawbacks, such as a limited capability to model long temporal dependencies and slow training and inference times due to their sequential processing nature. Recently, a few SELD studies used multi-head self-attention (MHSA), among other innovations in their models. MHSA and the related transformer networks have shown state-of-the-art performance in various domains. While they can model long temporal dependencies, they can also be parallelized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Acoustic Wave Phenomena Research