SwG-former: A Sliding-Window Graph Convolutional Network for   Simultaneous Spatial-Temporal Information Extraction in Sound Event   Localization and Detection

Weiming Huang; Qinghua Huang; Liyan Ma; Chuan Wang

arXiv:2310.14016·eess.AS·March 21, 2024·1 cites

SwG-former: A Sliding-Window Graph Convolutional Network for Simultaneous Spatial-Temporal Information Extraction in Sound Event Localization and Detection

Weiming Huang, Qinghua Huang, Liyan Ma, Chuan Wang

PDF

Open Access

TL;DR

This paper introduces SwG-former, a novel graph-based neural network that simultaneously captures spatial and temporal features in audio signals, significantly improving sound event localization and detection accuracy.

Contribution

The paper proposes the SwG-former block with sliding-window graph attention and a new Conv2dAgg function, advancing spatial-temporal feature extraction in SELD tasks.

Findings

01

SwG-former outperforms recent SELD models in accuracy.

02

SwG-EINV2 surpasses state-of-the-art methods in acoustic environments.

03

The model effectively captures higher-level spatial correlations.

Abstract

Sound event localization and detection (SELD) involves sound event detection (SED) and direction of arrival (DoA) estimation tasks. SED mainly relies on temporal dependencies to distinguish different sound classes, while DoA estimation depends on spatial correlations to estimate source directions. This paper addresses the need to simultaneously extract spatial-temporal information in audio signals to improve SELD performance. A novel block, the sliding-window graph-former (SwG-former), is designed to learn temporal context information of sound events based on their spatial correlations. The SwG-former block transforms audio signals into a graph representation and constructs graph vertices to capture higher abstraction levels for spatial correlations. It uses different-sized sliding windows to adapt various sound event durations and aggregates temporal features with similar spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis