All-neural beamformer for continuous speech separation

Zhuohuang Zhang; Takuya Yoshioka; Naoyuki Kanda; Zhuo Chen; Xiaofei; Wang; Dongmei Wang; Sefik Emre Eskimez

arXiv:2110.06428·eess.AS·October 14, 2021

All-neural beamformer for continuous speech separation

Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei, Wang, Dongmei Wang, Sefik Emre Eskimez

PDF

Open Access

TL;DR

This paper introduces an end-to-end neural beamformer for continuous speech separation that improves transcription accuracy and simplifies implementation compared to traditional methods, demonstrating significant performance gains on multiple datasets.

Contribution

It adapts the all deep learning MVDR (ADL-MVDR) model for continuous speech separation, enabling neural beamforming with end-to-end training and real-time processing capabilities.

Findings

01

Significant word error rate reduction on LibriCSS dataset.

02

Comparable performance to state-of-the-art MVDR systems on real meeting data.

03

Potential for reduced latency and simplified runtime implementation.

Abstract

Continuous speech separation (CSS) aims to separate overlapping voices from a continuous influx of conversational audio containing an unknown number of utterances spoken by an unknown number of speakers. A common application scenario is transcribing a meeting conversation recorded by a microphone array. Prior studies explored various deep learning models for time-frequency mask estimation, followed by a minimum variance distortionless response (MVDR) filter to improve the automatic speech recognition (ASR) accuracy. The performance of these methods is fundamentally upper-bounded by MVDR's spatial selectivity. Recently, the all deep learning MVDR (ADL-MVDR) model was proposed for neural beamforming and demonstrated superior performance in a target speech extraction task using pre-segmented input. In this paper, we further adapt ADL-MVDR to the CSS task with several enhancements to enable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques