Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers
Changsheng Quan, Xiaofei Li

TL;DR
This paper introduces an online multichannel speech enhancement neural network that effectively handles static and moving speakers over long audio streams, improving length extrapolation and maintaining high performance.
Contribution
It develops online variants of SpatialNet with linear inference complexity, enhancing long-term streaming speech enhancement for static and moving speakers.
Findings
Online SpatialNet achieves superior speech enhancement performance.
The proposed methods effectively handle long audio streams.
Length extrapolation is significantly improved with training strategies.
Abstract
In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
