Speech Recognition: Keyword Spotting Through Image Recognition

Sanjay Krishna Gouda; Salil Kanetkar; David Harrison; Manfred K; Warmuth

arXiv:1803.03759·stat.ML·November 25, 2020·19 cites

Speech Recognition: Keyword Spotting Through Image Recognition

Sanjay Krishna Gouda, Salil Kanetkar, David Harrison, Manfred K, Warmuth

PDF

Open Access

TL;DR

This paper explores converting speech recognition into an image classification problem using CNNs, comparing different architectures and applying Virtual Adversarial Training to improve robustness.

Contribution

It introduces a novel approach of framing speech recognition as an image classification task and demonstrates the effectiveness of VAT as a regularizer.

Findings

01

CNN models can effectively classify speech commands.

02

Virtual Adversarial Training enhances model robustness.

03

Conversion to image domain leverages advanced CNN techniques.

Abstract

The problem of identifying voice commands has always been a challenge due to the presence of noise and variability in speed, pitch, etc. We will compare the efficacies of several neural network architectures for the speech recognition problem. In particular, we will build a model to determine whether a one second audio clip contains a particular word (out of a set of 10), an unknown word, or silence. The models to be implemented are a CNN recommended by the Tensorflow Speech Recognition tutorial, a low-latency CNN, and an adversarially trained CNN. The result is a demonstration of how to convert a problem in audio recognition to the better-studied domain of image classification, where the powerful techniques of convolutional neural networks are fully developed. Additionally, we demonstrate the applicability of the technique of Virtual Adversarial Training (VAT) to this problem domain,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Speech Recognition and Synthesis · Music and Audio Processing