A comparison of streaming models and data augmentation methods for robust speech recognition
Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim

TL;DR
This study compares streaming speech recognition models MoChA and RNN-T, analyzing their robustness with various data augmentation techniques, and finds RNN-T generally outperforms MoChA in noise robustness, latency, and training stability.
Contribution
It provides a comprehensive comparison of MoChA and RNN-T models with multiple data augmentation methods for robust streaming speech recognition.
Findings
RNN-T models are more robust to noise and reverberation.
MoChA models perform better but are more sensitive to training factors.
RNN-T models have advantages in latency and training stability.
Abstract
In this paper, we present a comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T). We explore three recently proposed data augmentation techniques, namely, multi-conditioned training using an acoustic simulator, Vocal Tract Length Perturbation (VTLP) for speaker variability, and SpecAugment. Experimental results show that unidirectional models are in general more sensitive to noisy examples in the training set. It is observed that the final performance of the model depends on the proportion of training examples processed by data augmentation techniques. MoChA models generally perform better than RNN-T models. However, we observe that training of MoChA models seems to be more sensitive to various factors such as the characteristics of training sets and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
