Unsupervised Speech Segmentation: A General Approach Using Speech Language Models
Avishai Elmakies, Omri Abend, Yossi Adi

TL;DR
This paper presents a novel unsupervised speech segmentation method leveraging speech language models to identify multiple acoustic-semantic style changes in speech, outperforming traditional spectral change-based methods.
Contribution
It introduces a general unsupervised approach for speech segmentation that captures diverse acoustic-semantic distinctions using speech language models, extending beyond single-style change detection.
Findings
Superior boundary detection compared to baselines
Higher segment purity achieved
Reduced over-segmentation
Abstract
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSparse Evolutionary Training
