Additional Shared Decoder on Siamese Multi-view Encoders for Learning   Acoustic Word Embeddings

Myunghun Jung; Hyungjun Lim; Jahyun Goo; Youngmoon Jung; and Hoirin; Kim

arXiv:1910.00341·eess.AS·October 2, 2019·1 cites

Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings

Myunghun Jung, Hyungjun Lim, Jahyun Goo, Youngmoon Jung, and Hoirin, Kim

PDF

Open Access

TL;DR

This paper introduces a novel shared decoder architecture for Siamese multi-view encoders to improve acoustic word embeddings, achieving significant performance gains in word discrimination tasks.

Contribution

It proposes a shared decoder network integrated with Siamese multi-view encoders, enhancing the relationship between acoustic and text embeddings for better discrimination.

Findings

01

11.1% relative improvement in average precision on WSJ dataset

02

Effective in cross-view word discrimination tasks

03

Improved performance in word-level speech recognition

Abstract

Acoustic word embeddings --- fixed-dimensional vector representations of arbitrary-length words --- have attracted increasing interest in query-by-example spoken term detection. Recently, on the fact that the orthography of text labels partly reflects the phonetic similarity between the words' pronunciation, a multi-view approach has been introduced that jointly learns acoustic and text embeddings. It showed that it is possible to learn discriminative embeddings by designing the objective which takes text labels as well as word segments. In this paper, we propose a network architecture that expands the multi-view approach by combining the Siamese multi-view encoders with a shared decoder network to maximize the effect of the relationship between acoustic and text embeddings in embedding space. Discriminatively trained with multi-view triplet loss and decoding loss, our proposed approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques

MethodsTriplet Loss