TL;DR
This paper introduces a two-stage, language-independent approach for unsupervised acoustic unit discovery using a multilingual subword-discriminative feature representation, outperforming previous methods on low-resource speech data.
Contribution
It proposes replacing monolingual with multilingual ASR for better language independence and compares segment representation methods, advancing unsupervised acoustic unit discovery techniques.
Findings
Outperforms state-of-the-art AUD in NMI and F-score
Multilingual ASR improves phone boundary estimation
Significant performance gap with ground-truth boundaries
Abstract
This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. In the first stage, a recently proposed method in the task of unsupervised subword modeling is improved by replacing a monolingual out-of-domain (OOD) ASR system with a multilingual one to create a subword-discriminative representation that is more language-independent. In the second stage, segment-level k-means is adopted, and two methods to represent the variable-length speech segments as fixed-dimension feature vectors are compared. Experiments on a very low-resource Mboshi language corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
