GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting
Pai Zhu, Jacob W. Bartel, Dhruuv Agarwal, Kurt Partridge, Hyun Jin Park, Quan Wang

TL;DR
This paper introduces GE2E-KWS, a generalized end-to-end framework for zero-shot keyword spotting that improves training stability, evaluation reliability, and achieves superior accuracy with small, streamable models suitable for on-device deployment.
Contribution
The paper presents a novel GE2E-based training and evaluation framework for zero-shot keyword spotting, enhancing stability and accuracy over existing methods.
Findings
GE2E-KWS outperforms large ASR encoders by 23.6% relative AUC.
It surpasses triplet loss models of the same size by 60.7% AUC.
The models are small, streamable, and require no retraining for new keywords.
Abstract
We propose GE2E-KWS -- a generalized end-to-end training and evaluation framework for customized keyword spotting. Specifically, enrollment utterances are separated and grouped by keywords from the training batch and their embedding centroids are compared to all other test utterance embeddings to compute the loss. This simulates runtime enrollment and verification stages, and improves convergence stability and training speed by optimizing matrix operations compared to SOTA triplet loss approaches. To benchmark different models reliably, we propose an evaluation process that mimics the production environment and compute metrics that directly measure keyword matching accuracy. Trained with GE2E loss, our 419KB quantized conformer model beats a 7.5GB ASR encoder by 23.6% relative AUC, and beats a same size triplet loss model by 60.7% AUC. Our KWS models are natively streamable with low…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · AI in Service Interactions
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Triplet Loss
