# An embedded segmental K-means model for unsupervised segmentation and   clustering of speech

**Authors:** Herman Kamper, Karen Livescu, Sharon Goldwater

arXiv: 1703.08135 · 2017-09-06

## TL;DR

This paper introduces ES-KMeans, an efficient approximation of Bayesian speech segmentation and clustering that uses hard clustering, scales well to large datasets, and performs competitively in zero-resource speech tasks.

## Contribution

The paper presents ES-KMeans, a scalable, efficient hard clustering model for unsupervised speech segmentation and clustering, improving speed while maintaining competitive accuracy.

## Key findings

- ES-KMeans outperforms heuristic methods in word segmentation.
- It achieves similar performance to Bayesian models but is five times faster.
- The model scales effectively to large speech corpora across multiple languages.

## Abstract

Unsupervised segmentation and clustering of unlabelled speech are core problems in zero-resource speech processing. Most approaches lie at methodological extremes: some use probabilistic Bayesian models with convergence guarantees, while others opt for more efficient heuristic techniques. Despite competitive performance in previous work, the full Bayesian approach is difficult to scale to large speech corpora. We introduce an approximation to a recent Bayesian model that still has a clear objective function but improves efficiency by using hard clustering and segmentation rather than full Bayesian inference. Like its Bayesian counterpart, this embedded segmental K-means model (ES-KMeans) represents arbitrary-length word segments as fixed-dimensional acoustic word embeddings. We first compare ES-KMeans to previous approaches on common English and Xitsonga data sets (5 and 2.5 hours of speech): ES-KMeans outperforms a leading heuristic method in word segmentation, giving similar scores to the Bayesian model while being 5 times faster with fewer hyperparameters. However, its clusters are less pure than those of the other models. We then show that ES-KMeans scales to larger corpora by applying it to the 5 languages of the Zero Resource Speech Challenge 2017 (up to 45 hours), where it performs competitively compared to the challenge baseline.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.08135/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1703.08135/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/1703.08135/full.md

---
Source: https://tomesphere.com/paper/1703.08135