Random Utterance Concatenation Based Data Augmentation for Improving   Short-video Speech Recognition

Yist Y. Lin; Tao Han; Haihua Xu; Van Tung Pham; Yerbolat Khassanov,; Tze Yuang Chong; Yi He; Lu Lu; Zejun Ma

arXiv:2210.15876·eess.AS·May 26, 2023

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

Yist Y. Lin, Tao Han, Haihua Xu, Van Tung Pham, Yerbolat Khassanov,, Tze Yuang Chong, Yi He, Lu Lu, Zejun Ma

PDF

Open Access

TL;DR

This paper introduces a random utterance concatenation data augmentation technique to address train-test length mismatch in short-video speech recognition, significantly improving accuracy across multiple languages.

Contribution

The proposed RUC method is a novel on-the-fly augmentation that enhances long utterance recognition without harming short utterance performance in ASR.

Findings

01

Achieved 5.72% WER reduction on average across 15 languages.

02

Improved robustness to utterance length mismatch.

03

Enhanced recognition of longer spontaneous speech utterances.

Abstract

One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsTest