An enhanced Conv-TasNet model for speech separation using a speaker   distance-based loss function

Jose A. Arango-S\'anchez; Juli\'an D. Arias-Londo\~no

arXiv:2205.13657·eess.AS·June 20, 2022

An enhanced Conv-TasNet model for speech separation using a speaker distance-based loss function

Jose A. Arango-S\'anchez, Juli\'an D. Arias-Londo\~no

PDF

Open Access 1 Repo

TL;DR

This paper improves speech separation in Spanish by enhancing Conv-TasNet with a speaker distance-based loss, achieving better SI-SDR scores and analyzing real-time deployment challenges.

Contribution

It introduces a novel Conv-TasNet architecture incorporating speaker similarity in the loss function, tailored for Spanish speech separation.

Findings

01

Best SI-SDR of 10.6 dB with the enhanced model

02

Inverse relationship between speaker similarity and performance

03

Real-time deployment issues with speaker channel synchronization

Abstract

This work addresses the problem of speech separation in the Spanish Language using pre-trained deep learning models. As with many speech processing tasks, large databases in other languages different from English are scarce. Therefore this work explores different training strategies using the Conv-TasNet model as a benchmark. A scale-invariant signal distortion ratio (SI-SDR) metric value of 9.9 dB was achieved for the best training strategy. Then, experimentally, we identified an inverse relationship between the speakers' similarity and the model's performance, so an improved ConvTasNet architecture was proposed. The enhanced Conv-TasNet model uses pre-trained speech embeddings to add a between-speakers cosine similarity term in the cost function, yielding an SI-SDR of 10.6 dB. Lastly, final experiments regarding real-time deployment show some drawbacks in the speakers' channel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dw-speech-separation/train-test-convtasnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsConvolutional time-domain audio separation network