Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness   of Multi-Stream End-to-End ASR

Ruizhi Li; Gregory Sell; Hynek Hermansky

arXiv:2102.03055·cs.SD·February 8, 2021

Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

Ruizhi Li, Gregory Sell, Hynek Hermansky

PDF

TL;DR

This paper enhances multi-stream end-to-end ASR robustness by introducing a two-stage augmentation scheme and adaptive CTC fusion, significantly reducing word error rates across unseen stream combinations under varied acoustic conditions.

Contribution

It proposes a novel two-stage augmentation and adaptive CTC fusion method to improve robustness of multi-stream ASR systems against environmental distortions and unseen conditions.

Findings

01

Achieved 29.7-59.3% relative WER reduction on DIRHA and AMI datasets.

02

Demonstrated effectiveness of augmentation and adaptive fusion in handling unseen stream combinations.

03

Improved robustness against background noise and reverberations.

Abstract

Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. Previously, a practical two-stage training strategy was proposed within multi-stream end-to-end ASR, where Stage-2 formulates the multi-stream model with features from Stage-1 Universal Feature Extractor (UFE). In this paper, as an extension, we introduce a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1 Augmentation aims to address single-stream input varieties with data augmentation techniques; Stage-2 Time Masking applies temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.