Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection
Sung-Feng Huang, Yi-Chen Chen, Hung-yi Lee, Lin-shan Lee

TL;DR
This paper introduces an adjacency-based clustering method for audio embeddings, improving the grouping of similar linguistic units in unsupervised settings, with applications demonstrated in spoken term detection.
Contribution
It proposes novel adjacency-based clustering approaches inspired by Siamese networks to enhance audio embedding compactness without labeled data.
Findings
Improved clustering of audio embeddings on LibriSpeech dataset
Enhanced spoken term detection performance
Effective disentangling of speaker characteristics from embeddings
Abstract
Embedding audio signal segments into vectors with fixed dimensionality is attractive because all following processing will be easier and more efficient, for example modeling, classifying or indexing. Audio Word2Vec previously proposed was shown to be able to represent audio segments for spoken words as such vectors carrying information about the phonetic structures of the signal segments. However, each linguistic unit (word, syllable, phoneme in text form) corresponds to unlimited number of audio segments with vector representations inevitably spread over the embedding space, which causes some confusion. It is therefore desired to better cluster the audio embeddings such that those corresponding to the same linguistic unit can be more compactly distributed. In this paper, inspired by Siamese networks, we propose some approaches to achieve the above goal. This includes identifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
