Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR
Ruizhi Li, Gregory Sell, Hynek Hermansky

TL;DR
This paper enhances multi-stream end-to-end ASR robustness by introducing a two-stage augmentation scheme and adaptive CTC fusion, significantly reducing word error rates across unseen stream combinations under varied acoustic conditions.
Contribution
It proposes a novel two-stage augmentation and adaptive CTC fusion method to improve robustness of multi-stream ASR systems against environmental distortions and unseen conditions.
Findings
Achieved 29.7-59.3% relative WER reduction on DIRHA and AMI datasets.
Demonstrated effectiveness of augmentation and adaptive fusion in handling unseen stream combinations.
Improved robustness against background noise and reverberations.
Abstract
Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. Previously, a practical two-stage training strategy was proposed within multi-stream end-to-end ASR, where Stage-2 formulates the multi-stream model with features from Stage-1 Universal Feature Extractor (UFE). In this paper, as an extension, we introduce a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1 Augmentation aims to address single-stream input varieties with data augmentation techniques; Stage-2 Time Masking applies temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
