Frame-level SpecAugment for Deep Convolutional Neural Networks in Hybrid ASR Systems
Xinwei Li, Yuanyuan Zhang, Xiaodan Zhuang, Daben Liu

TL;DR
This paper introduces frame-level SpecAugment (f-SpecAugment), a data augmentation technique applied at the convolution window level to improve deep CNN performance in hybrid ASR systems, showing significant WER reductions.
Contribution
The paper proposes a novel frame-level application of SpecAugment for deep CNN hybrid ASR models, demonstrating its effectiveness over utterance-level augmentation.
Findings
f-SpecAugment reduces WER by up to 4.5% relative.
It remains effective with large-scale training data (up to 25,000 hours).
f-SpecAugment's benefits are comparable to doubling training data size.
Abstract
Inspired by SpecAugment -- a data augmentation method for end-to-end ASR systems, we propose a frame-level SpecAugment method (f-SpecAugment) to improve the performance of deep convolutional neural networks (CNN) for hybrid HMM based ASR systems. Similar to the utterance level SpecAugment, f-SpecAugment performs three transformations: time warping, frequency masking, and time masking. Instead of applying the transformations at the utterance level, f-SpecAugment applies them to each convolution window independently during training. We demonstrate that f-SpecAugment is more effective than the utterance level SpecAugment for deep CNN based hybrid models. We evaluate the proposed f-SpecAugment on 50-layer Self-Normalizing Deep CNN (SNDCNN) acoustic models trained with up to 25000 hours of training data. We observe f-SpecAugment reduces WER by 0.5-4.5% relatively across different ASR tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsConvolution
