Implicit Self-supervised Language Representation for Spoken Language   Diarization

Jagabandhu Mishra; S. R. Mahadeva Prasanna

arXiv:2308.10470·eess.AS·July 16, 2024

Implicit Self-supervised Language Representation for Spoken Language Diarization

Jagabandhu Mishra, S. R. Mahadeva Prasanna

PDF

Open Access

TL;DR

This paper introduces a self-supervised implicit language representation for spoken language diarization in code-switched scenarios, outperforming traditional x-vector methods especially in low-resource settings.

Contribution

It proposes a novel self-supervised implicit language representation that improves diarization accuracy over x-vector based methods in code-switched speech.

Findings

01

Self-supervised representation reduces JER by 63.9% compared to x-vector.

02

E2E framework with the new representation achieves a JER of 21.8.

03

Performance varies with dataset and segment duration, highlighting the importance of representation choice.

Abstract

In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length ( $N$ ) can able to achieve at per performance with explicit LD. The best implicit LD performance of $6.38$ in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to $60.4$ while using with practical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems