Frame-level SpecAugment for Deep Convolutional Neural Networks in Hybrid   ASR Systems

Xinwei Li; Yuanyuan Zhang; Xiaodan Zhuang; Daben Liu

arXiv:2012.04094·cs.CL·December 9, 2020

Frame-level SpecAugment for Deep Convolutional Neural Networks in Hybrid ASR Systems

Xinwei Li, Yuanyuan Zhang, Xiaodan Zhuang, Daben Liu

PDF

Open Access

TL;DR

This paper introduces frame-level SpecAugment (f-SpecAugment), a data augmentation technique applied at the convolution window level to improve deep CNN performance in hybrid ASR systems, showing significant WER reductions.

Contribution

The paper proposes a novel frame-level application of SpecAugment for deep CNN hybrid ASR models, demonstrating its effectiveness over utterance-level augmentation.

Findings

01

f-SpecAugment reduces WER by up to 4.5% relative.

02

It remains effective with large-scale training data (up to 25,000 hours).

03

f-SpecAugment's benefits are comparable to doubling training data size.

Abstract

Inspired by SpecAugment -- a data augmentation method for end-to-end ASR systems, we propose a frame-level SpecAugment method (f-SpecAugment) to improve the performance of deep convolutional neural networks (CNN) for hybrid HMM based ASR systems. Similar to the utterance level SpecAugment, f-SpecAugment performs three transformations: time warping, frequency masking, and time masking. Instead of applying the transformations at the utterance level, f-SpecAugment applies them to each convolution window independently during training. We demonstrate that f-SpecAugment is more effective than the utterance level SpecAugment for deep CNN based hybrid models. We evaluate the proposed f-SpecAugment on 50-layer Self-Normalizing Deep CNN (SNDCNN) acoustic models trained with up to 25000 hours of training data. We observe f-SpecAugment reduces WER by 0.5-4.5% relatively across different ASR tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConvolution