EdgeSpot: Efficient and High-Performance Few-Shot Model for Keyword Spotting
Oguzhan Buyuksolak, Alican Gok, Osman Erman Okman

TL;DR
EdgeSpot is a novel, efficient few-shot keyword spotting model optimized for edge devices, combining a lightweight backbone, self-supervised training, and knowledge distillation to outperform existing baselines in accuracy and computational cost.
Contribution
The paper introduces EdgeSpot, a new lightweight few-shot keyword spotting model that achieves higher accuracy and efficiency on edge devices using innovative training and architectural techniques.
Findings
EdgeSpot-4 improves 10-shot accuracy from 73.7% to 82.0% at 1% FAR.
The model requires only 29.4M MACs and 128k parameters.
EdgeSpot outperforms strong BC-ResNet baselines in accuracy at fixed FAR.
Abstract
We introduce an efficient few-shot keyword spotting model for edge devices, EdgeSpot, that pairs an optimized version of a BC-ResNet-based acoustic backbone with a trainable Per-Channel Energy Normalization frontend and lightweight temporal self-attention. Knowledge distillation is utilized during training by employing a self-supervised teacher model, optimized with Sub-center ArcFace loss. This study demonstrates that the EdgeSpot model consistently provides better accuracy at a fixed false-alarm rate (FAR) than strong BC-ResNet baselines. The largest variant, EdgeSpot-4, improves the 10-shot accuracy at 1% FAR from 73.7% to 82.0%, which requires only 29.4M MACs with 128k parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
