Multi-QuartzNet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion
Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao

TL;DR
This paper introduces Multi-QuartzNet, an enhanced speech recognition model that employs multi-resolution convolutions, channel-wise attention, and multi-layer feature fusion to improve performance over the original QuartzNet.
Contribution
The paper presents a novel multi-resolution convolution module, a channel-wise attention mechanism, and a multi-layer feature fusion approach for end-to-end speech recognition.
Findings
Achieves CER of 6.77% on AISHELL-1 dataset.
Outperforms the original QuartzNet model.
Close to state-of-the-art results.
Abstract
In this paper, we propose an end-to-end speech recognition network based on Nvidia's previous QuartzNet model. We try to promote the model performance, and design three components: (1) Multi-Resolution Convolution Module, replaces the original 1D time-channel separable convolution with multi-stream convolutions. Each stream has a unique dilated stride on convolutional operations. (2) Channel-Wise Attention Module, calculates the attention weight of each convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature Fusion Module, reweights each convolutional block by global multi-layer feature maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
