SwG-former: A Sliding-Window Graph Convolutional Network for Simultaneous Spatial-Temporal Information Extraction in Sound Event Localization and Detection
Weiming Huang, Qinghua Huang, Liyan Ma, Chuan Wang

TL;DR
This paper introduces SwG-former, a novel graph-based neural network that simultaneously captures spatial and temporal features in audio signals, significantly improving sound event localization and detection accuracy.
Contribution
The paper proposes the SwG-former block with sliding-window graph attention and a new Conv2dAgg function, advancing spatial-temporal feature extraction in SELD tasks.
Findings
SwG-former outperforms recent SELD models in accuracy.
SwG-EINV2 surpasses state-of-the-art methods in acoustic environments.
The model effectively captures higher-level spatial correlations.
Abstract
Sound event localization and detection (SELD) involves sound event detection (SED) and direction of arrival (DoA) estimation tasks. SED mainly relies on temporal dependencies to distinguish different sound classes, while DoA estimation depends on spatial correlations to estimate source directions. This paper addresses the need to simultaneously extract spatial-temporal information in audio signals to improve SELD performance. A novel block, the sliding-window graph-former (SwG-former), is designed to learn temporal context information of sound events based on their spatial correlations. The SwG-former block transforms audio signals into a graph representation and constructs graph vertices to capture higher abstraction levels for spatial correlations. It uses different-sized sliding windows to adapt various sound event durations and aggregates temporal features with similar spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
