TL;DR
Conv-TasNet is an end-to-end time-domain speech separation network that outperforms traditional time-frequency masking methods and ideal masks, with lower latency and smaller model size, suitable for real-time applications.
Contribution
The paper introduces Conv-TasNet, a novel fully-convolutional time-domain network that surpasses previous methods and ideal masks in speech separation accuracy and efficiency.
Findings
Outperforms previous time-frequency masking methods in separation quality.
Surpasses ideal time-frequency magnitude masks in objective and subjective evaluations.
Has smaller model size and shorter latency, enabling real-time processing.
Abstract
Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolutional time-domain audio separation network
