ARiSE: Auto-Regressive Multi-Channel Speech Enhancement

Pengjie Shen; Xueliang Zhang; Zhong-Qiu Wang

arXiv:2505.22051·eess.AS·June 9, 2025·Interspeech

ARiSE: Auto-Regressive Multi-Channel Speech Enhancement

Pengjie Shen, Xueliang Zhang, Zhong-Qiu Wang

PDF

Open Access

TL;DR

ARiSE introduces an auto-regressive multi-channel speech enhancement method that leverages previous estimates to improve current speech estimation, with a novel parallel training mechanism for efficiency, showing promising results in noisy-reverberant environments.

Contribution

The paper presents a novel auto-regressive approach for multi-channel speech enhancement that incorporates previous frame estimates and beamforming, along with a parallel training method to accelerate learning.

Findings

01

Effective in noisy-reverberant conditions

02

Improves speech enhancement performance

03

Parallel training speeds up model development

Abstract

We propose ARiSE, an auto-regressive algorithm for multi-channel speech enhancement. ARiSE improves existing deep neural network (DNN) based frame-online multi-channel speech enhancement models by introducing auto-regressive connections, where the estimated target speech at previous frames is leveraged as extra input features to help the DNN estimate the target speech at the current frame. The extra input features can be derived from (a) the estimated target speech in previous frames; and (b) a beamformed mixture with the beamformer computed based on the previous estimated target speech. On the other hand, naively training the DNN in an auto-regressive manner is very slow. To deal with this, we propose a parallel training mechanism to speed up the training. Evaluation results in noisy-reverberant conditions show the effectiveness and potential of the proposed algorithms.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Hearing Loss and Rehabilitation

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings