Multi-Speaker and Wide-Band Simulated Conversations as Training Data for   End-to-End Neural Diarization

Federico Landini; Mireia Diez; Alicia Lozano-Diez; Luk\'a\v{s} Burget

arXiv:2211.06750·eess.AS·February 27, 2023·1 cites

Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization

Federico Landini, Mireia Diez, Alicia Lozano-Diez, Luk\'a\v{s} Burget

PDF

Open Access 3 Repos

TL;DR

This paper introduces multi-speaker, wide-band simulated conversations as training data for end-to-end neural diarization, significantly improving performance and reducing the need for fine-tuning.

Contribution

It presents a novel method for generating multi-speaker, wide-band simulated conversations, enhancing training data for neural diarization and enabling better model performance.

Findings

01

Multi-speaker simulated conversations outperform traditional simulated mixtures.

02

Wide-band simulated data improves diarization accuracy across various datasets.

03

Reduced dependence on fine-tuning with the new simulated data approach.

Abstract

End-to-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once. Many flavors of end-to-end models have been proposed but all of them require (so far non-existing) large amounts of annotated data for training. The compromise solution consists in generating synthetic data and the recently proposed simulated conversations (SC) have shown remarkable improvements over the original simulated mixtures (SM). In this work, we create SC with multiple speakers per conversation and show that they allow for substantially better performance than SM, also reducing the dependence on a fine-tuning stage. We also create SC with wide-band public audio sources and present an analysis on several evaluation sets. Together with this publication, we release the recipes for generating such data and models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling