CarneliNet: Neural Mixture Model for Automatic Speech Recognition

Aleksei Kalinov; Somshubra Majumdar; Jagadeesh Balam; Boris Ginsburg

arXiv:2107.10708·eess.AS·July 23, 2021

CarneliNet: Neural Mixture Model for Automatic Speech Recognition

Aleksei Kalinov, Somshubra Majumdar, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access

TL;DR

CarneliNet introduces a neural mixture model with parallel shallow networks for speech recognition, achieving near state-of-the-art results and adaptable computational complexity without retraining.

Contribution

It proposes a novel neural mixture architecture with parallel shallow networks, offering improved streaming performance and dynamic reconfiguration capabilities.

Findings

01

Achieved near state-of-the-art results on LibriSpeech, MLS, and AISHELL-2 datasets.

02

Demonstrated dynamic reconfiguration of sub-networks without retraining.

03

Validated the effectiveness of parallel shallow networks over deep models.

Abstract

End-to-end automatic speech recognition systems have achieved great accuracy by using deeper and deeper models. However, the increased depth comes with a larger receptive field that can negatively impact model performance in streaming scenarios. We propose an alternative approach that we call Neural Mixture Model. The basic idea is to introduce a parallel mixture of shallow networks instead of a very deep network. To validate this idea we design CarneliNet -- a CTC-based neural network composed of three mega-blocks. Each mega-block consists of multiple parallel shallow sub-networks based on 1D depthwise-separable convolutions. We evaluate the model on LibriSpeech, MLS and AISHELL-2 datasets and achieved close to state-of-the-art results for CTC-based models. Finally, we demonstrate that one can dynamically reconfigure the number of parallel sub-networks to accommodate the computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing