Asymmetric Proxy Loss for Multi-View Acoustic Word Embeddings
Myunghun Jung, Hoirin Kim

TL;DR
This paper introduces an asymmetric-proxy loss for multi-view acoustic word embeddings, improving word discrimination by leveraging a proxy-based deep metric learning framework that considers asymmetric relationships.
Contribution
It proposes a novel asymmetric-proxy loss within a proxy-based framework for multi-view acoustic word embeddings, enhancing discriminative power in speech representation learning.
Findings
The proposed asymmetric-proxy loss outperforms existing proxy-based losses.
The method improves word discrimination accuracy on WSJ corpus.
Experimental results validate the effectiveness of the new loss function.
Abstract
Acoustic word embeddings (AWEs) are discriminative representations of speech segments, and learned embedding space reflects the phonetic similarity between words. With multi-view learning, where text labels are considered as supplementary input, AWEs are jointly trained with acoustically grounded word embeddings (AGWEs). In this paper, we expand the multi-view approach into a proxy-based framework for deep metric learning by equating AGWEs with proxies. A simple modification in computing the similarity matrix allows the general pair weighting to formulate the data-to-proxy relationship. Under the systematized framework, we propose an asymmetric-proxy loss that combines different parts of loss functions asymmetrically while keeping their merits. It follows the assumptions that the optimal function for anchor-positive pairs may differ from one for anchor-negative pairs, and a proxy may…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
