Handling Trade-Offs in Speech Separation with Sparsely-Gated Mixture of   Experts

Xiaofei Wang; Zhuo Chen; Yu Shi; Jian Wu; Naoyuki Kanda; Takuya; Yoshioka

arXiv:2211.06493·eess.AS·June 1, 2023·1 cites

Handling Trade-Offs in Speech Separation with Sparsely-Gated Mixture of Experts

Xiaofei Wang, Zhuo Chen, Yu Shi, Jian Wu, Naoyuki Kanda, Takuya, Yoshioka

PDF

Open Access

TL;DR

This paper introduces a sparsely-gated mixture-of-experts architecture for monaural speech separation, effectively balancing model size, separation quality, and computational cost, especially in overlapped speech scenarios.

Contribution

The paper proposes a novel sparsely-gated MoE architecture that improves speech separation performance while reducing artifacts and maintaining low computational overhead.

Findings

01

Achieves superior separation with less distortion

02

Maintains low computational cost with marginal runtime increase

03

Effective on both simulated and real recordings

Abstract

Employing a monaural speech separation (SS) model as a front-end for automatic speech recognition (ASR) involves balancing two kinds of trade-offs. First, while a larger model improves the SS performance, it also requires a higher computational cost. Second, an SS model that is more optimized for handling overlapped speech is likely to introduce more processing artifacts in non-overlapped-speech regions. In this paper, we address these trade-offs with a sparsely-gated mixture-of-experts (MoE) architecture. Comprehensive evaluation results obtained using both simulated and real meeting recordings show that our proposed sparsely-gated MoE SS model achieves superior separation capabilities with less speech distortion, while involving only a marginal run-time cost increase.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing