TL;DR
This paper introduces DPCCN, a robust time-frequency domain speech separation and extraction network that outperforms existing time-domain methods, especially in cross-domain and noisy environments, by using a densely-connected pyramid structure and a novel speaker encoder.
Contribution
The paper presents a novel densely-connected pyramid complex convolutional network (DPCCN) for robust speech separation and extraction, including a new speaker encoder and a Mixture-Remix adaptation method for cross-domain tasks.
Findings
DPCCN outperforms time-domain methods in robustness and accuracy.
Mixture-Remix fine-tuning significantly improves cross-domain speech extraction.
DPCCN achieves around 3.5 dB SISNR improvement in cross-domain tests.
Abstract
In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated conditions. Furthermore, we generalize the DPCCN to target speech extraction (TSE) by integrating a new specially designed speaker encoder. Moreover, we also investigate the robustness of DPCCN to unsupervised cross-domain TSE tasks. A Mixture-Remix approach is proposed to adapt the target domain acoustic characteristics for fine-tuning the source model. We evaluate the proposed methods not only under noisy and reverberant in-domain condition, but also in clean but cross-domain conditions. Results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
