MixSpeech: Data Augmentation for Low-resource Automatic Speech   Recognition

Linghui Meng; Jin Xu; Xu Tan; Jindong Wang; Tao Qin; Bo Xu

arXiv:2102.12664·cs.CL·February 26, 2021·6 cites

MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition

Linghui Meng, Jin Xu, Xu Tan, Jindong Wang, Tao Qin, Bo Xu

PDF

Open Access

TL;DR

MixSpeech introduces a mixup-based data augmentation technique for low-resource automatic speech recognition, improving accuracy over existing methods by combining speech features and recognition losses.

Contribution

The paper presents MixSpeech, a novel data augmentation method that enhances low-resource ASR performance by applying mixup to speech features and recognition losses.

Findings

01

MixSpeech outperforms baseline models without augmentation.

02

MixSpeech surpasses SpecAugment in accuracy, with a 10.6% PER improvement on TIMIT.

03

Achieves a WER of 4.7% on WSJ dataset.

Abstract

In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR). MixSpeech trains an ASR model by taking a weighted combination of two different speech features (e.g., mel-spectrograms or MFCC) as the input, and recognizing both text sequences, where the two recognition losses use the same combination weight. We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer, and conduct experiments on several low-resource datasets including TIMIT, WSJ, and HKUST. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strong data augmentation method SpecAugment on these recognition tasks. Specifically, MixSpeech outperforms SpecAugment with a relative PER improvement of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing · Softmax · Multi-Head Attention · Adam · Mixup · Dense Connections