State Space and Self-Attention Collaborative Network with Feature Aggregation for DOA Estimation
Qi You, Qinghua Huang, Yi-Cheng Lin

TL;DR
The paper introduces FA-Stateformer, a novel neural network architecture combining feature aggregation, state space modeling, and self-attention for improved and efficient sound source DOA estimation.
Contribution
It proposes a new collaborative network integrating feature aggregation, a lightweight Conformer, temporal shift, and bidirectional Mamba modules for enhanced temporal modeling and efficiency.
Findings
Outperforms conventional architectures in accuracy.
Achieves better computational efficiency.
Demonstrates robustness across various scenarios.
Abstract
Accurate direction-of-arrival (DOA) estimation for sound sources is challenging due to the continuous changes in acoustic characteristics across time and frequency. In such scenarios, accurate localization relies on the ability to aggregate relevant features and model temporal dependencies effectively. In time series modeling, achieving a balance between model performance and computational efficiency remains a significant challenge. To address this, we propose FA-Stateformer, a state space and self-attention collaborative network with feature aggregation. The proposed network first employs a feature aggregation module to enhance informative features across both temporal and spectral dimensions. This is followed by a lightweight Conformer architecture inspired by the squeeze-and-excitation mechanism, where the feedforward layers are compressed to reduce redundancy and parameter overhead.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
