TL;DR
This paper introduces a novel ResNet-based keyword spotting model using temporal Lambda networks, achieving state-of-the-art accuracy with significantly reduced complexity and faster inference compared to attention-based models.
Contribution
It pioneers the application of Lambda networks in speech, creating a lightweight, efficient architecture that outperforms existing models in accuracy and speed.
Findings
Achieves state-of-the-art accuracy on Google Speech Commands dataset.
Reduces model size by up to 85% compared to Transformer-based models.
Increases inference speed by up to 100 times.
Abstract
Models based on attention mechanisms have shown unprecedented speech recognition performance. However, they are computationally expensive and unnecessarily complex for keyword spotting, a task targeted to small-footprint devices. This work explores the application of Lambda networks, an alternative framework for capturing long-range interactions without attention, for the keyword spotting task. We propose a novel \textit{ResNet}-based model by swapping the residual blocks by temporal Lambda layers. Furthermore, the proposed architecture is built upon uni-dimensional temporal convolutions that further reduce its complexity. The presented model does not only reach state-of-the-art accuracies on the Google Speech Commands dataset, but it is 85% and 65% lighter than its Transformer-based (KWT) and convolutional (Res15) counterparts while being up to 100 times faster. To the best of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
