Phonetic-and-Semantic Embedding of Spoken Words with Applications in   Spoken Content Retrieval

Yi-Chen Chen; Sung-Feng Huang; Chia-Hao Shen; Hung-yi Lee; Lin-shan; Lee

arXiv:1807.08089·cs.CL·January 23, 2019

Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval

Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, Hung-yi Lee, Lin-shan, Lee

PDF

Open Access

TL;DR

This paper introduces a two-stage phonetic-and-semantic embedding framework for spoken words, enhancing spoken content retrieval by capturing both phonetic and semantic features while disentangling speaker characteristics.

Contribution

It proposes a novel two-stage embedding method that separately encodes phonetic and semantic information in spoken words, with evaluation against text embeddings.

Findings

01

Improved spoken document retrieval using combined phonetic and semantic embeddings.

02

Demonstrated that embeddings can retrieve semantically related spoken content.

03

Phonetic and semantic features can be disentangled and effectively used in retrieval tasks.

Abstract

Word embedding or Word2Vec has been successful in offering semantics for text words learned from the context of words. Audio Word2Vec was shown to offer phonetic structures for spoken words (signal segments for words) learned from signals within spoken words. This paper proposes a two-stage framework to perform phonetic-and-semantic embedding on spoken words considering the context of the spoken words. Stage 1 performs phonetic embedding with speaker characteristics disentangled. Stage 2 then performs semantic embedding in addition. We further propose to evaluate the phonetic-and-semantic nature of the audio embeddings obtained in Stage 2 by parallelizing with text embeddings. In general, phonetic structure and semantics inevitably disturb each other. For example the words "brother" and "sister" are close in semantics but very different in phonetic structure, while the words "brother"…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Music and Audio Processing