Unsupervised Spoken Term Discovery Based on Re-clustering of Hypothesized Speech Segments with Siamese and Triplet Networks
Man-Ling Sung, Tan Lee

TL;DR
This paper introduces an unsupervised spoken term discovery method that uses re-clustering with Siamese and Triplet networks trained on hypothesized speech segments, improving accuracy over traditional methods.
Contribution
It proposes a novel approach to generate training examples from initial hypotheses, enabling effective re-clustering with deep similarity networks in an unsupervised setting.
Findings
Improved spoken term discovery accuracy over baseline methods.
Effective training of Siamese/Triplet networks using hypothesized segment pairs.
Re-clustering enhances the quality of discovered speech units.
Abstract
Spoken term discovery from untranscribed speech audio could be achieved via a two-stage process. In the first stage, the unlabelled speech is decoded into a sequence of subword units that are learned and modelled in an unsupervised manner. In the second stage, partial sequence matching and clustering are performed on the decoded subword sequences, resulting in a set of discovered words or phrases. A limitation of this approach is that the results of subword decoding could be erroneous, and the errors would impact the subsequent steps. While Siamese/Triplet network is one approach to learn segment representations that can improve the discovery process, the challenge in spoken term discovery under a complete unsupervised scenario is that training examples are unavailable. In this paper, we propose to generate training examples from initial hypothesized sequence clusters. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
