MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting
Zhiqi Ai, Zhiyong Chen, Shugong Xu

TL;DR
MM-KWS introduces a multi-modal, multilingual keyword spotting system that combines text and speech embeddings, significantly improving detection accuracy across languages like Mandarin and English.
Contribution
The paper presents a novel multi-modal approach for user-defined keyword spotting that integrates multilingual pre-trained models and advanced data augmentation techniques.
Findings
Outperforms previous methods on LibriPhrase and WenetPhrase datasets.
Effective across Mandarin and English tasks.
Enhanced ability to distinguish confusable words.
Abstract
In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Digital Communication and Language · Speech and dialogue systems
MethodsFocus
