Robust Automatic Speech Recognition via WavAugment Guided Phoneme   Adversarial Training

Gege Qi; Yuefeng Chen; Xiaofeng Mao; Xiaojun Jia; Ranjie Duan; Rong; Zhang; Hui Xue

arXiv:2307.12498·cs.SD·July 25, 2023

Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training

Gege Qi, Yuefeng Chen, Xiaofeng Mao, Xiaojun Jia, Ranjie Duan, Rong, Zhang, Hui Xue

PDF

Open Access

TL;DR

This paper introduces WavAugment Guided Phoneme Adversarial Training (wapat), a novel method that enhances the robustness of automatic speech recognition models against perturbations and domain shifts, achieving state-of-the-art results.

Contribution

The paper proposes a new adversarial training approach using phoneme space augmentation and guidance, improving ASR robustness and generalization over previous methods.

Findings

01

Wapat reduces WER by 6.28% on ESB benchmark.

02

Wapat improves model robustness to small perturbations.

03

Achieves new state-of-the-art performance on ESB.

Abstract

Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use adversarial examples in phoneme space as augmentation to make the model invariant to minor fluctuations in phoneme representation and preserve the performance on clean samples. In addition, wapat utilizes the phoneme representation of augmented samples to guide the generation of adversaries, which helps to find more stable and diverse gradient-directions, resulting in improved generalization. Extensive experiments demonstrate the effectiveness of wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing