Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training
Gege Qi, Yuefeng Chen, Xiaofeng Mao, Xiaojun Jia, Ranjie Duan, Rong, Zhang, Hui Xue

TL;DR
This paper introduces WavAugment Guided Phoneme Adversarial Training (wapat), a novel method that enhances the robustness of automatic speech recognition models against perturbations and domain shifts, achieving state-of-the-art results.
Contribution
The paper proposes a new adversarial training approach using phoneme space augmentation and guidance, improving ASR robustness and generalization over previous methods.
Findings
Wapat reduces WER by 6.28% on ESB benchmark.
Wapat improves model robustness to small perturbations.
Achieves new state-of-the-art performance on ESB.
Abstract
Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use adversarial examples in phoneme space as augmentation to make the model invariant to minor fluctuations in phoneme representation and preserve the performance on clean samples. In addition, wapat utilizes the phoneme representation of augmented samples to guide the generation of adversaries, which helps to find more stable and diverse gradient-directions, resulting in improved generalization. Extensive experiments demonstrate the effectiveness of wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
