Unsupervised Discovery of Structured Acoustic Tokens with Applications to Spoken Term Detection
Cheng-Tao Chung, Lin-Shan Lee

TL;DR
This paper compares two unsupervised methods for discovering structured acoustic tokens from speech data, unifies them theoretically, and demonstrates their effectiveness in spoken term detection tasks.
Contribution
It introduces a unified theoretical framework for multigranular and hierarchical acoustic token discovery and enhances their performance in spoken term detection.
Findings
Both paradigms achieve competitive results in spoken term detection.
The Enhanced Relevance Score improves detection accuracy.
Results on QUESST and Zero Resource Challenge datasets validate the approaches.
Abstract
In this paper, we compare two paradigms for unsupervised discovery of structured acoustic tokens directly from speech corpora without any human annotation. The Multigranular Paradigm seeks to capture all available information in the corpora with multiple sets of tokens for different model granularities. The Hierarchical Paradigm attempts to jointly learn several levels of signal representations in a hierarchical structure. The two paradigms are unified within a theoretical framework in this paper. Query-by-Example Spoken Term Detection (QbE-STD) experiments on the QUESST dataset of MediaEval 2015 verifies the competitiveness of the acoustic tokens. The Enhanced Relevance Score (ERS) proposed in this work improves both paradigms for the task of QbE-STD. We also list results on the ABX evaluation task of the Zero Resource Challenge 2015 for comparison of the Paradigms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
