Effects of Number of Filters of Convolutional Layers on Speech Recognition Model Accuracy
James Mou, Jun Li

TL;DR
This paper investigates how the number of filters in convolutional layers affects speech recognition accuracy, revealing a threshold effect and proposing a lightweight, high-accuracy end-to-end model with potential applications in mobile and embedded systems.
Contribution
It systematically studies filter number effects on CNN+RNN speech models and develops a lightweight end-to-end system achieving high accuracy with significantly fewer parameters.
Findings
Performance improves only when CNN filters exceed a threshold.
Proposed model achieves 90.2% accuracy with 4.4 million parameters.
Model size is about 10% of DeepSpeech2, with comparable accuracy.
Abstract
Inspired by the progress of the End-to-End approach [1], this paper systematically studies the effects of Number of Filters of convolutional layers on the model prediction accuracy of CNN+RNN (Convolutional Neural Networks adding to Recurrent Neural Networks) for ASR Models (Automatic Speech Recognition). Experimental results show that only when the CNN Number of Filters exceeds a certain threshold value is adding CNN to RNN able to improve the performance of the CNN+RNN speech recognition model, otherwise some parameter ranges of CNN can render it useless to add the CNN to the RNN model. Our results show a strong dependency of word accuracy on the Number of Filters of convolutional layers. Based on the experimental results, the paper suggests a possible hypothesis of Sound-2-Vector Embedding (Convolutional Embedding) to explain the above observations. Based on this Embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
