Effects of Number of Filters of Convolutional Layers on Speech   Recognition Model Accuracy

James Mou; Jun Li

arXiv:2102.02326·cs.LG·February 5, 2021

Effects of Number of Filters of Convolutional Layers on Speech Recognition Model Accuracy

James Mou, Jun Li

PDF

TL;DR

This paper investigates how the number of filters in convolutional layers affects speech recognition accuracy, revealing a threshold effect and proposing a lightweight, high-accuracy end-to-end model with potential applications in mobile and embedded systems.

Contribution

It systematically studies filter number effects on CNN+RNN speech models and develops a lightweight end-to-end system achieving high accuracy with significantly fewer parameters.

Findings

01

Performance improves only when CNN filters exceed a threshold.

02

Proposed model achieves 90.2% accuracy with 4.4 million parameters.

03

Model size is about 10% of DeepSpeech2, with comparable accuracy.

Abstract

Inspired by the progress of the End-to-End approach [1], this paper systematically studies the effects of Number of Filters of convolutional layers on the model prediction accuracy of CNN+RNN (Convolutional Neural Networks adding to Recurrent Neural Networks) for ASR Models (Automatic Speech Recognition). Experimental results show that only when the CNN Number of Filters exceeds a certain threshold value is adding CNN to RNN able to improve the performance of the CNN+RNN speech recognition model, otherwise some parameter ranges of CNN can render it useless to add the CNN to the RNN model. Our results show a strong dependency of word accuracy on the Number of Filters of convolutional layers. Based on the experimental results, the paper suggests a possible hypothesis of Sound-2-Vector Embedding (Convolutional Embedding) to explain the above observations. Based on this Embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.