Unsupervised word segmentation and lexicon discovery using acoustic word embeddings
Herman Kamper, Aren Jansen, Sharon Goldwater

TL;DR
This paper introduces an unsupervised Bayesian model that segments speech and discovers word groupings directly from audio, enabling tokenization without transcriptions or predefined vocabularies.
Contribution
It presents a novel acoustic embedding-based Bayesian approach for unsupervised speech segmentation and lexicon discovery, outperforming previous HMM-based methods.
Findings
Achieves around 20% word error rate in digit recognition
Outperforms previous HMM-based systems by about 10% absolute
Does not require pre-specified vocabulary size
Abstract
In settings where only unlabelled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text. A similar problem is faced when modelling infant language acquisition. In these cases, categorical linguistic structure needs to be discovered directly from speech audio. We present a novel unsupervised Bayesian model that segments unlabelled speech and clusters the segments into hypothesized word groupings. The result is a complete unsupervised tokenization of the input speech in terms of discovered word types. In our approach, a potential word segment (of arbitrary length) is embedded in a fixed-dimensional acoustic vector space. The model, implemented as a Gibbs sampler, then builds a whole-word acoustic model in this space while jointly performing segmentation. We report word error rates in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
