Multi-Dimensional and Multi-Scale Modeling for Speech Separation   Optimized by Discriminative Learning

Zhaoxi Mu; Xinyu Yang; Wenjing Zhu

arXiv:2303.03737·cs.SD·March 8, 2023·1 cites

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

Zhaoxi Mu, Xinyu Yang, Wenjing Zhu

PDF

Open Access

TL;DR

This paper introduces ISCIT, a novel multi-dimensional, multi-scale speech separation model that leverages discriminative learning to improve separation quality, especially for similar-sounding speakers, achieving state-of-the-art results.

Contribution

The paper proposes a new SE-Conformer architecture with multi-scale modeling, multi-block feature aggregation, and a speaker similarity discriminative loss for enhanced speech separation.

Findings

01

Achieves state-of-the-art results on WSJ0-2mix and WHAM! datasets.

02

Effectively models multi-dimensional and multi-scale audio features.

03

Improves separation performance for speakers with similar voices.

Abstract

Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network SE-Conformer that can model audio sequences in multiple dimensions and scales, and apply it to the dual-path speech separation framework. Furthermore, we propose Multi-Block Feature Aggregation to improve the separation effect by selectively utilizing information from the intermediate blocks of the separation network. Meanwhile, we propose a speaker similarity discriminative loss to optimize the speech separation model to address the problem of poor performance when speakers have similar…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing