Speech Recognition: Keyword Spotting Through Image Recognition
Sanjay Krishna Gouda, Salil Kanetkar, David Harrison, Manfred K, Warmuth

TL;DR
This paper explores converting speech recognition into an image classification problem using CNNs, comparing different architectures and applying Virtual Adversarial Training to improve robustness.
Contribution
It introduces a novel approach of framing speech recognition as an image classification task and demonstrates the effectiveness of VAT as a regularizer.
Findings
CNN models can effectively classify speech commands.
Virtual Adversarial Training enhances model robustness.
Conversion to image domain leverages advanced CNN techniques.
Abstract
The problem of identifying voice commands has always been a challenge due to the presence of noise and variability in speed, pitch, etc. We will compare the efficacies of several neural network architectures for the speech recognition problem. In particular, we will build a model to determine whether a one second audio clip contains a particular word (out of a set of 10), an unknown word, or silence. The models to be implemented are a CNN recommended by the Tensorflow Speech Recognition tutorial, a low-latency CNN, and an adversarially trained CNN. The result is a demonstration of how to convert a problem in audio recognition to the better-studied domain of image classification, where the powerful techniques of convolutional neural networks are fully developed. Additionally, we demonstrate the applicability of the technique of Virtual Adversarial Training (VAT) to this problem domain,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Speech Recognition and Synthesis · Music and Audio Processing
