Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning
Zhaoxi Mu, Xinyu Yang, Wenjing Zhu

TL;DR
This paper introduces ISCIT, a novel multi-dimensional, multi-scale speech separation model that leverages discriminative learning to improve separation quality, especially for similar-sounding speakers, achieving state-of-the-art results.
Contribution
The paper proposes a new SE-Conformer architecture with multi-scale modeling, multi-block feature aggregation, and a speaker similarity discriminative loss for enhanced speech separation.
Findings
Achieves state-of-the-art results on WSJ0-2mix and WHAM! datasets.
Effectively models multi-dimensional and multi-scale audio features.
Improves separation performance for speakers with similar voices.
Abstract
Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network SE-Conformer that can model audio sequences in multiple dimensions and scales, and apply it to the dual-path speech separation framework. Furthermore, we propose Multi-Block Feature Aggregation to improve the separation effect by selectively utilizing information from the intermediate blocks of the separation network. Meanwhile, we propose a speaker similarity discriminative loss to optimize the speech separation model to address the problem of poor performance when speakers have similar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
