VarArray: Array-Geometry-Agnostic Continuous Speech Separation
Takuya Yoshioka, Xiaofei Wang, Dongmei Wang, Min Tang, Zirun Zhu, Zhuo, Chen, Naoyuki Kanda

TL;DR
VarArray introduces an array-geometry-agnostic neural network for continuous speech separation that adapts to any number of microphones, improving real-world transcription accuracy without retraining.
Contribution
The paper presents a novel neural network model that is applicable to any microphone array configuration without retraining, combining multiple techniques for enhanced speech separation.
Findings
Outperforms previous array-geometry-agnostic models across configurations.
Achieves speaker-agnostic word error rates of 17.5% and 20.4%.
Effective in realistic meeting transcription scenarios.
Abstract
Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription. This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model. The proposed model is applicable to any number of microphones without retraining while leveraging the nonlinear correlation between the input channels. The proposed method adapts different elements that were proposed before separately, including transform-average-concatenate, conformer speech separation, and inter-channel phase differences, and combines them in an efficient and cohesive way. Large-scale evaluation was performed with two real meeting transcription tasks by using a fully developed transcription system requiring no prior knowledge such as reference segmentations, which allowed us to measure the impact that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
