SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer   Based Speech Recognition

Khanh Le; Tuan Vu Ho; Dung Tran; Duc Thanh Chau

arXiv:2502.14685·cs.SD·February 21, 2025

SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition

Khanh Le, Tuan Vu Ho, Dung Tran, Duc Thanh Chau

PDF

Open Access

TL;DR

SegAug is a novel data augmentation method for RNN-Transducer speech recognition that reduces deletion errors and improves accuracy by diversifying training data with contextually varied audio-text pairs.

Contribution

The paper introduces SegAug, an alignment-based augmentation technique that enhances RNN-T robustness by encouraging focus on acoustic features and diversifying textual patterns.

Findings

01

Up to 12.5% relative WER reduction on LibriSpeech.

02

Significant decrease in deletion errors by 45.4%.

03

Effective across small-scale and large-scale datasets.

Abstract

RNN-Transducer (RNN-T) is a widely adopted architecture in speech recognition, integrating acoustic and language modeling in an end-to-end framework. However, the RNN-T predictor tends to over-rely on consecutive word dependencies in training data, leading to high deletion error rates, particularly with less common or out-of-domain phrases. Existing solutions, such as regularization and data augmentation, often compromise other aspects of performance. We propose SegAug, an alignment-based augmentation technique that generates contextually varied audio-text pairs with low sentence-level semantics. This method encourages the model to focus more on acoustic features while diversifying the learned textual patterns of its internal language model, thereby reducing deletion errors and enhancing overall performance. Evaluations on the LibriSpeech and Tedlium-v3 datasets demonstrate a relative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques