Deep CNNs along the Time Axis with Intermap Pooling for Robustness to Spectral Variations
Hwaran Lee, Geonmin Kim, Ho-Gyeong Kim, Sang-Hoon Oh, and Soo-Young, Lee

TL;DR
This paper introduces a novel deep CNN architecture with intermap pooling along the time axis, enhancing robustness to spectral variations in speech recognition tasks.
Contribution
It proposes the intermap pooling layer and advocates convolution along the time axis, improving spectral variation invariance in CNNs for speech recognition.
Findings
Achieved 12.7% WER on Hub5'2000 SWB test set.
Demonstrated robustness without speaker adaptation.
Outperformed previous CNN-based models in spectral invariance.
Abstract
Convolutional neural networks (CNNs) with convolutional and pooling operations along the frequency axis have been proposed to attain invariance to frequency shifts of features. However, this is inappropriate with regard to the fact that acoustic features vary in frequency. In this paper, we contend that convolution along the time axis is more effective. We also propose the addition of an intermap pooling (IMP) layer to deep CNNs. In this layer, filters in each group extract common but spectrally variant features, then the layer pools the feature maps of each group. As a result, the proposed IMP CNN can achieve insensitivity to spectral variations characteristic of different speakers and utterances. The effectiveness of the IMP CNN architecture is demonstrated on several LVCSR tasks. Even without speaker adaptation techniques, the architecture achieved a WER of 12.7% on the SWB part of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
