Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices
Yael Segal-Feldman, Ann R. Bradlow, Matthew Goldrick, and Joseph Keshet

TL;DR
This paper presents a novel open-vocabulary keyword spotting model optimized for small devices, leveraging hyper-matched filters and a Perceiver-based detection network to achieve high accuracy and robustness, even in out-of-domain conditions.
Contribution
Introduces a keyword spotting model with hyper-network generated filters and a Perceiver-based detection mechanism, achieving state-of-the-art accuracy on small-footprint devices.
Findings
Achieves state-of-the-art detection accuracy.
Generalizes well to out-of-domain and L2 speech.
Smallest model (4.2M parameters) matches larger models.
Abstract
Open-vocabulary keyword spotting (KWS) refers to the task of detecting words or terms within speech recordings, regardless of whether they were included in the training data. This paper introduces an open-vocabulary keyword spotting model with state-of-the-art detection accuracy for small-footprint devices. The model is composed of a speech encoder, a target keyword encoder, and a detection network. The speech encoder is either a tiny Whisper or a tiny Conformer. The target keyword encoder is implemented as a hyper-network that takes the desired keyword as a character string and generates a unique set of weights for a convolutional layer, which can be considered as a keyword-specific matched filter. The detection network uses the matched-filter weights to perform a keyword-specific convolution, which guides the cross-attention mechanism of a Perceiver module in determining whether the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
