Implicit spoken language diarization

Jagabandhu Mishra; Amartya Chowdhury; S. R. Mahadeva Prasanna

arXiv:2306.12913·eess.AS·June 23, 2023

Implicit spoken language diarization

Jagabandhu Mishra, Amartya Chowdhury, S. R. Mahadeva Prasanna

PDF

Open Access

TL;DR

This paper investigates implicit spoken language diarization using deep learning embeddings, demonstrating that pre-trained wave2vec embeddings significantly improve performance on synthetic and real data, reducing error rates.

Contribution

It explores the use of implicit language modeling with deep embeddings for diarization, showing improved results with pre-trained wave2vec features over traditional phonotactic methods.

Findings

01

End-to-end x-vector approach achieves 6.78% and 7.06% error rates on synthetic data.

02

Performance drops on practical data due to data imbalance.

03

Pre-trained wave2vec embeddings improve JER by 30.74%.

Abstract

Spoken language diarization (LD) and related tasks are mostly explored using the phonotactic approach. Phonotactic approaches mostly use explicit way of language modeling, hence requiring intermediate phoneme modeling and transcribed data. Alternatively, the ability of deep learning approaches to model temporal dynamics may help for the implicit modeling of language information through deep embedding vectors. Hence this work initially explores the available speaker diarization frameworks that capture speaker information implicitly to perform LD tasks. The performance of the LD system on synthetic code-switch data using the end-to-end x-vector approach is 6.78% and 7.06%, and for practical data is 22.50% and 60.38%, in terms of diarization error rate and Jaccard error rate (JER), respectively. The performance degradation is due to the data imbalance and resolved to some extent by using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing