All-neural beamformer for continuous speech separation
Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei, Wang, Dongmei Wang, Sefik Emre Eskimez

TL;DR
This paper introduces an end-to-end neural beamformer for continuous speech separation that improves transcription accuracy and simplifies implementation compared to traditional methods, demonstrating significant performance gains on multiple datasets.
Contribution
It adapts the all deep learning MVDR (ADL-MVDR) model for continuous speech separation, enabling neural beamforming with end-to-end training and real-time processing capabilities.
Findings
Significant word error rate reduction on LibriCSS dataset.
Comparable performance to state-of-the-art MVDR systems on real meeting data.
Potential for reduced latency and simplified runtime implementation.
Abstract
Continuous speech separation (CSS) aims to separate overlapping voices from a continuous influx of conversational audio containing an unknown number of utterances spoken by an unknown number of speakers. A common application scenario is transcribing a meeting conversation recorded by a microphone array. Prior studies explored various deep learning models for time-frequency mask estimation, followed by a minimum variance distortionless response (MVDR) filter to improve the automatic speech recognition (ASR) accuracy. The performance of these methods is fundamentally upper-bounded by MVDR's spatial selectivity. Recently, the all deep learning MVDR (ADL-MVDR) model was proposed for neural beamforming and demonstrated superior performance in a target speech extraction task using pre-segmented input. In this paper, we further adapt ADL-MVDR to the CSS task with several enhancements to enable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
