Implicit Self-supervised Language Representation for Spoken Language Diarization
Jagabandhu Mishra, S. R. Mahadeva Prasanna

TL;DR
This paper introduces a self-supervised implicit language representation for spoken language diarization in code-switched scenarios, outperforming traditional x-vector methods especially in low-resource settings.
Contribution
It proposes a novel self-supervised implicit language representation that improves diarization accuracy over x-vector based methods in code-switched speech.
Findings
Self-supervised representation reduces JER by 63.9% compared to x-vector.
E2E framework with the new representation achieves a JER of 21.8.
Performance varies with dataset and segment duration, highlighting the importance of representation choice.
Abstract
In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length () can able to achieve at per performance with explicit LD. The best implicit LD performance of in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to while using with practical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
