Neural approaches to spoken content embedding

Shane Settle

arXiv:2308.14905·cs.CL·August 30, 2023·1 cites

Neural approaches to spoken content embedding

Shane Settle

PDF

Open Access

TL;DR

This paper introduces novel neural acoustic word embedding methods, including multilingual training, that improve speech segment comparison and downstream task performance, surpassing traditional algorithms and complementing self-supervised models.

Contribution

It presents new RNN-based discriminative and acoustically grounded embedding models, enhancing training efficiency and performance across multiple languages and tasks.

Findings

01

Multilingual training improves embedding quality with limited labeled data.

02

Proposed models outperform traditional dynamic programming methods.

03

Embeddings effectively support speech search and recognition tasks.

Abstract

Comparing spoken segments is a central operation to speech processing. Traditional approaches in this area have favored frame-level dynamic programming algorithms, such as dynamic time warping, because they require no supervision, but they are limited in performance and efficiency. As an alternative, acoustic word embeddings -- fixed-dimensional vector representations of variable-length spoken word segments -- have begun to be considered for such tasks as well. However, the current space of such discriminative embedding models, training approaches, and their application to real-world downstream tasks is limited. We start by considering ``single-view" training losses where the goal is to learn an acoustic word embedding model that separates same-word and different-word spoken segment pairs. Then, we consider ``multi-view" contrastive losses. In this setting, acoustic word embeddings are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques