MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword   Spotting

Zhiqi Ai; Zhiyong Chen; Shugong Xu

arXiv:2406.07310·eess.AS·June 12, 2024

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting

Zhiqi Ai, Zhiyong Chen, Shugong Xu

PDF

Open Access 1 Repo

TL;DR

MM-KWS introduces a multi-modal, multilingual keyword spotting system that combines text and speech embeddings, significantly improving detection accuracy across languages like Mandarin and English.

Contribution

The paper presents a novel multi-modal approach for user-defined keyword spotting that integrates multilingual pre-trained models and advanced data augmentation techniques.

Findings

01

Outperforms previous methods on LibriPhrase and WenetPhrase datasets.

02

Effective across Mandarin and English tasks.

03

Enhanced ability to distinguish confusable words.

Abstract

In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aizhiqi-work/MM-KWS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Digital Communication and Language · Speech and dialogue systems

MethodsFocus