Towards unsupervised phone and word segmentation using self-supervised   vector-quantized neural networks

Herman Kamper; Benjamin van Niekerk

arXiv:2012.07551·cs.CL·June 14, 2021

Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

Herman Kamper, Benjamin van Niekerk

PDF

TL;DR

This paper introduces a novel approach using self-supervised vector-quantized neural networks for unsupervised speech segmentation into phone-like units, achieving competitive results at lower bitrates across various speech tasks.

Contribution

It proposes a new VQ-based segmentation method that works across multiple speech tasks without supervision, outperforming some state-of-the-art approaches at lower bitrates.

Findings

01

Penalized dynamic programming yields best segmentation results.

02

Method performs well across diverse speech tasks.

03

Achieves lower bitrate performance compared to some state-of-the-art methods.

Abstract

We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units. Two segmentation methods are considered. In the first, features are greedily merged until a prespecified number of segments are reached. The second uses dynamic programming to optimize a squared error with a penalty term to encourage fewer but longer segments. We show that these VQ segmentation methods can be used without alteration across a wide range of tasks: unsupervised phone segmentation, ABX phone discrimination, same-different word discrimination, and as inputs to a symbolic word segmentation algorithm. The penalized dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.