MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition
Linghui Meng, Jin Xu, Xu Tan, Jindong Wang, Tao Qin, Bo Xu

TL;DR
MixSpeech introduces a mixup-based data augmentation technique for low-resource automatic speech recognition, improving accuracy over existing methods by combining speech features and recognition losses.
Contribution
The paper presents MixSpeech, a novel data augmentation method that enhances low-resource ASR performance by applying mixup to speech features and recognition losses.
Findings
MixSpeech outperforms baseline models without augmentation.
MixSpeech surpasses SpecAugment in accuracy, with a 10.6% PER improvement on TIMIT.
Achieves a WER of 4.7% on WSJ dataset.
Abstract
In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR). MixSpeech trains an ASR model by taking a weighted combination of two different speech features (e.g., mel-spectrograms or MFCC) as the input, and recognizing both text sequences, where the two recognition losses use the same combination weight. We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer, and conduct experiments on several low-resource datasets including TIMIT, WSJ, and HKUST. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strong data augmentation method SpecAugment on these recognition tasks. Specifically, MixSpeech outperforms SpecAugment with a relative PER improvement of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing · Softmax · Multi-Head Attention · Adam · Mixup · Dense Connections
