Multi-Scale Feature Fusion Transformer Network for End-to-End Single Channel Speech Separation
Yinhao Xu, Jian Zhou, Liang Tao, and Hon Keung Kwan

TL;DR
This paper introduces MSFFT-Net, a multi-scale feature fusion transformer for single-channel speech separation, outperforming previous dual-path models by capturing local and global context through parallel processing paths.
Contribution
The paper proposes a novel multi-scale feature fusion transformer network with parallel paths and iterative intra- and inter-chunk operations for improved speech separation.
Findings
Achieved state-of-the-art SI-SNRi scores on WSJ0-2mix dataset.
Outperformed original dual-path models without data augmentation.
Demonstrated better results with multiple parallel processing paths.
Abstract
Recently studies on time-domain audio separation networks (TasNets) have made a great stride in speech separation. One of the most representative TasNets is a network with a dual-path segmentation approach. However, the original model called DPRNN used a fixed feature dimension and unchanged segment size throughout all layers of the network. In this paper, we propose a multi-scale feature fusion transformer network (MSFFT-Net) based on the conventional dual-path structure for single-channel speech separation. Unlike the conventional dual-path structure where only one processing path exists, adopting several iterative blocks with alternative intra-chunk and inter-chunk operations to capture local and global context information, the proposed MSFFT-Net has multiple parallel processing paths where the feature information can be exchanged between multiple parallel processing paths.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
