MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting
Lo-Ya Li, Tien-Hong Lo, Jeih-Weih Hung, Shih-Chieh Huang, Berlin Chen

TL;DR
MALEFA is a lightweight zero-shot keyword spotting framework that improves accuracy and reduces false alarms by multi-granularity learning, suitable for real-time use on resource-limited devices.
Contribution
It introduces a novel joint learning approach with cross-attention and contrastive objectives for zero-shot KWS, enhancing accuracy and efficiency.
Findings
Achieves 90% accuracy on benchmark datasets.
Reduces false alarm rate to 0.007% on AMI dataset.
Supports real-time deployment on resource-constrained devices.
Abstract
User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, often leading to a pesky false alarm rate (FAR) in real-world deployments. To mitigate these limitations, we put forward MALEFA, a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective. Evaluations on four public benchmark datasets show that MALEFA achieves a high accuracy of 90%, significantly reducing FAR to 0.007% on the AMI dataset. Beyond its strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
