Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings
Badr M. Abdullah, Bernd M\"obius, Dietrich Klakow

TL;DR
This paper introduces a multi-task learning model for acoustic word embeddings that integrates high-level lexical knowledge, improving the discriminability and lexical category separation of the embeddings across three languages.
Contribution
The paper presents a novel multi-task learning approach that combines acoustic cues with lexical knowledge during training of AWEs, which was not previously explored.
Findings
Enhanced discriminability of embeddings
Better separation of lexical categories
Improved performance across three languages
Abstract
Models of acoustic word embeddings (AWEs) learn to map variable-length spoken word segments onto fixed-dimensionality vector representations such that different acoustic exemplars of the same word are projected nearby in the embedding space. In addition to their speech technology applications, AWE models have been shown to predict human performance on a variety of auditory lexical processing tasks. Current AWE models are based on neural networks and trained in a bottom-up approach that integrates acoustic cues to build up a word representation given an acoustic or symbolic supervision signal. Therefore, these models do not leverage or capture high-level lexical knowledge during the learning process. In this paper, we propose a multi-task learning model that incorporates top-down lexical knowledge into the training procedure of AWEs. Our model learns a mapping between the acoustic input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
