MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation

Xiyun Li; Yong Xu; Meng Yu; Shi-Xiong Zhang; Jiaming Xu; and Bo Xu; Dong Yu

arXiv:2104.08450·cs.SD·April 27, 2021·1 cites

MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation

Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, and Bo Xu, Dong Yu

PDF

Open Access

TL;DR

This paper introduces a self-attentive RNN beamformer that leverages temporal and spatial self-attention modules to enhance multi-speaker speech separation, improving ASR accuracy and speech quality over previous methods.

Contribution

It proposes a novel self-attentive RNN beamformer with temporal and spatial attention modules, and a multi-channel MIMO model for more efficient multi-speaker speech separation.

Findings

01

Improved ASR accuracy compared to prior methods.

02

Enhanced speech quality as measured by PESQ.

03

Better modeling of covariance matrices through self-attention.

Abstract

Recently, our proposed recurrent neural network (RNN) based all deep learning minimum variance distortionless response (ADL-MVDR) beamformer method yielded superior performance over the conventional MVDR by replacing the matrix inversion and eigenvalue decomposition with two recurrent neural networks. In this work, we present a self-attentive RNN beamformer to further improve our previous RNN-based beamformer by leveraging on the powerful modeling capability of self-attention. Temporal-spatial self-attention module is proposed to better learn the beamforming weights from the speech and noise spatial covariance matrices. The temporal self-attention module could help RNN to learn global statistics of covariance matrices. The spatial self-attention module is designed to attend on the cross-channel correlation in the covariance matrices. Furthermore, a multi-channel input with multi-speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing