Improved Audio Embeddings by Adjacency-Based Clustering with   Applications in Spoken Term Detection

Sung-Feng Huang; Yi-Chen Chen; Hung-yi Lee; Lin-shan Lee

arXiv:1811.02775·cs.CL·November 8, 2018·5 cites

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

Sung-Feng Huang, Yi-Chen Chen, Hung-yi Lee, Lin-shan Lee

PDF

Open Access

TL;DR

This paper introduces an adjacency-based clustering method for audio embeddings, improving the grouping of similar linguistic units in unsupervised settings, with applications demonstrated in spoken term detection.

Contribution

It proposes novel adjacency-based clustering approaches inspired by Siamese networks to enhance audio embedding compactness without labeled data.

Findings

01

Improved clustering of audio embeddings on LibriSpeech dataset

02

Enhanced spoken term detection performance

03

Effective disentangling of speaker characteristics from embeddings

Abstract

Embedding audio signal segments into vectors with fixed dimensionality is attractive because all following processing will be easier and more efficient, for example modeling, classifying or indexing. Audio Word2Vec previously proposed was shown to be able to represent audio segments for spoken words as such vectors carrying information about the phonetic structures of the signal segments. However, each linguistic unit (word, syllable, phoneme in text form) corresponds to unlimited number of audio segments with vector representations inevitably spread over the embedding space, which causes some confusion. It is therefore desired to better cluster the audio embeddings such that those corresponding to the same linguistic unit can be more compactly distributed. In this paper, inspired by Siamese networks, we propose some approaches to achieve the above goal. This includes identifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing