TL;DR
This paper introduces an asymmetric encoder-decoder architecture with global and local Transformer blocks for speech separation, achieving state-of-the-art results efficiently without chunking.
Contribution
It proposes a novel asymmetric separation framework with global and local Transformer blocks that handle long sequences more efficiently, surpassing traditional dual-path models.
Findings
Achieved state-of-the-art performance on benchmark datasets.
Reduced computational complexity compared to dual-path models.
Effectively handles long sequences without chunking.
Abstract
In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage of the network. Instead, we propose a more intuitive strategy that separates features earlier by expanding the feature sequence to the number of speakers as an extra dimension. To achieve this, an asymmetric strategy is presented in which the encoder and decoder are partitioned to perform distinct processing in separation tasks. The encoder analyzes features, and the output of the encoder is split into the number of speakers to be separated. The separated sequences are then reconstructed by the weight-shared decoder, which also performs cross-speaker processing. Without relying on speaker information, the weight-shared network in the decoder directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
