Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks
Herman Kamper, Benjamin van Niekerk

TL;DR
This paper introduces a novel approach using self-supervised vector-quantized neural networks for unsupervised speech segmentation into phone-like units, achieving competitive results at lower bitrates across various speech tasks.
Contribution
It proposes a new VQ-based segmentation method that works across multiple speech tasks without supervision, outperforming some state-of-the-art approaches at lower bitrates.
Findings
Penalized dynamic programming yields best segmentation results.
Method performs well across diverse speech tasks.
Achieves lower bitrate performance compared to some state-of-the-art methods.
Abstract
We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units. Two segmentation methods are considered. In the first, features are greedily merged until a prespecified number of segments are reached. The second uses dynamic programming to optimize a squared error with a penalty term to encourage fewer but longer segments. We show that these VQ segmentation methods can be used without alteration across a wide range of tasks: unsupervised phone segmentation, ABX phone discrimination, same-different word discrimination, and as inputs to a symbolic word segmentation algorithm. The penalized dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
