Multi-QuartzNet: Multi-Resolution Convolution for Speech Recognition   with Multi-Layer Feature Fusion

Jian Luo; Jianzong Wang; Ning Cheng; Guilin Jiang; Jing Xiao

arXiv:2011.13090·eess.AS·November 30, 2020·SLT·1 cites

Multi-QuartzNet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion

Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao

PDF

Open Access

TL;DR

This paper introduces Multi-QuartzNet, an enhanced speech recognition model that employs multi-resolution convolutions, channel-wise attention, and multi-layer feature fusion to improve performance over the original QuartzNet.

Contribution

The paper presents a novel multi-resolution convolution module, a channel-wise attention mechanism, and a multi-layer feature fusion approach for end-to-end speech recognition.

Findings

01

Achieves CER of 6.77% on AISHELL-1 dataset.

02

Outperforms the original QuartzNet model.

03

Close to state-of-the-art results.

Abstract

In this paper, we propose an end-to-end speech recognition network based on Nvidia's previous QuartzNet model. We try to promote the model performance, and design three components: (1) Multi-Resolution Convolution Module, replaces the original 1D time-channel separable convolution with multi-stream convolutions. Each stream has a unique dilated stride on convolutional operations. (2) Channel-Wise Attention Module, calculates the attention weight of each convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature Fusion Module, reweights each convolutional block by global multi-layer feature maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing