Open vocabulary keyword spotting through transfer learning from speech   synthesis

Kesavaraj V; Anil Kumar Vuppala

arXiv:2404.03914·cs.HC·April 19, 2024·1 cites

Open vocabulary keyword spotting through transfer learning from speech synthesis

Kesavaraj V, Anil Kumar Vuppala

PDF

Open Access

TL;DR

This paper introduces a transfer learning framework from text-to-speech systems to improve open-vocabulary keyword spotting, addressing modality mismatch issues and enhancing robustness across datasets and OOV scenarios.

Contribution

The novel approach leverages TTS knowledge transfer to align audio and text representations, significantly improving keyword spotting performance over existing methods.

Findings

01

Outperforms baseline methods on multiple datasets.

02

Achieves 8.22% higher AUC on LibriPhrase Hard.

03

Demonstrates robustness across different word lengths and OOV scenarios.

Abstract

Identifying keywords in an open-vocabulary context is crucial for personalizing interactions with smart devices. Previous approaches to open vocabulary keyword spotting dependon a shared embedding space created by audio and text encoders. However, these approaches suffer from heterogeneous modality representations (i.e., audio-text mismatch). To address this issue, our proposed framework leverages knowledge acquired from a pre-trained text-to-speech (TTS) system. This knowledge transfer allows for the incorporation of awareness of audio projections into the text representations derived from the text encoder. The performance of the proposed approach is compared with various baseline methods across four different datasets. The robustness of our proposed model is evaluated by assessing its performance across different word lengths and in an Out-of-Vocabulary (OOV) scenario. Additionally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis