Continuous Speech Separation with Ad Hoc Microphone Arrays
Dongmei Wang, Takuya Yoshioka, Zhuo Chen, Xiaofei Wang, Tianyan Zhou,, Zhong Meng

TL;DR
This paper introduces a transformer-based continuous speech separation method tailored for ad hoc microphone arrays, effectively handling real recordings and reducing speech duplication issues, thereby improving multi-talker speech recognition accuracy.
Contribution
It extends neural network-based speech separation to continuous recordings with ad hoc arrays, proposing techniques for device mismatch mitigation and speaker counting.
Findings
Significant improvement in ASR accuracy for overlapped speech.
Effective handling of real continuous recordings with ad hoc arrays.
Minimal performance loss on single talker segments.
Abstract
Speech separation has been shown effective for multi-talker speech recognition. Under the ad hoc microphone array setup where the array consists of spatially distributed asynchronous microphones, additional challenges must be overcome as the geometry and number of microphones are unknown beforehand. Prior studies show, with a spatial-temporalinterleaving structure, neural networks can efficiently utilize the multi-channel signals of the ad hoc array. In this paper, we further extend this approach to continuous speech separation. Several techniques are introduced to enable speech separation for real continuous recordings. First, we apply a transformer-based network for spatio-temporal modeling of the ad hoc array signals. In addition, two methods are proposed to mitigate a speech duplication problem during single talker segments, which seems more severe in the ad hoc array scenarios. One…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsHigh-Order Consensuses
