End-to-end Keyword Spotting using Xception-1d
Iv\'an Vall\'es-P\'erez, Juan G\'omez-Sanchis, Marcelino, Mart\'inez-Sober, Joan Vila-Franc\'es, Antonio J. Serrano-L\'opez, Emilio, Soria-Olivas

TL;DR
This paper presents an end-to-end keyword spotting system using an adapted Xception-1D model, achieving state-of-the-art accuracy of 96% across 35 categories, surpassing human performance.
Contribution
The work adapts the Xception architecture for audio keyword spotting, demonstrating its effectiveness and achieving superior accuracy in complex classification tasks.
Findings
Achieved 96% accuracy on 35-category keyword classification
Outperformed human annotation in complex tasks
Validated the effectiveness of Xception-1D for audio analysis
Abstract
The field of conversational agents is growing fast and there is an increasing need for algorithms that enhance natural interaction. In this work we show how we achieved state of the art results in the Keyword Spotting field by adapting and tweaking the Xception algorithm, which achieved outstanding results in several computer vision tasks. We obtained about 96\% accuracy when classifying audio clips belonging to 35 different categories, beating human annotation at the most complex tasks proposed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsDepthwise Convolution · Pointwise Convolution · Depthwise Separable Convolution · Average Pooling · 1x1 Convolution · Dense Connections · Max Pooling · Softmax · Global Average Pooling · Convolution
