TL;DR
This paper introduces a novel segmental Bayesian framework for fully-unsupervised large-vocabulary speech recognition, applying it to multi-speaker data and comparing it to state-of-the-art baselines, despite high error rates.
Contribution
It presents the first large-vocabulary multi-speaker unsupervised speech recognition system using segmental acoustic embeddings and a Bayesian model, improving segmentation and clustering quality.
Findings
Outperforms bottom-up syllable-based approaches in segmentation and clustering.
Achieves high word error rates (~70-95%), highlighting task difficulty.
Discovered clusters have greater coverage but lower purity than term discovery systems.
Abstract
Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early term discovery systems focused on identifying isolated recurring patterns in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units---effectively performing unsupervised speech recognition. This article presents the first attempt we are aware of to apply such a system to large-vocabulary multi-speaker data. Our system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
