Enhancing and Adversarial: Improve ASR with Speaker Labels

Wei Zhou; Haotian Wu; Jingjing Xu; Mohammad Zeineldeen; Christoph; L\"uscher; Ralf Schl\"uter; Hermann Ney

arXiv:2211.06369·eess.AS·October 19, 2023

Enhancing and Adversarial: Improve ASR with Speaker Labels

Wei Zhou, Haotian Wu, Jingjing Xu, Mohammad Zeineldeen, Christoph, L\"uscher, Ralf Schl\"uter, Hermann Ney

PDF

Open Access

TL;DR

This paper explores the use of speaker labels in multi-task learning to enhance conformer-based automatic speech recognition, proposing a novel adaptive gradient reversal layer for stable adversarial training and analyzing optimal training strategies.

Contribution

It introduces an adaptive gradient reversal layer for adversarial training and systematically studies the application of speaker labels in domain-aware and domain-agnostic ASR models.

Findings

01

Achieved 7% relative improvement on Switchboard Hub5'00.

02

Optimal placement of speaker-enhancing and adversarial training layers identified.

03

Combining speaker-based MTL with adversarial training matches i-vector performance.

Abstract

ASR can be improved by multi-task learning (MTL) with domain enhancing or domain adversarial training, which are two opposite objectives with the aim to increase/decrease domain variance towards domain-aware/agnostic ASR, respectively. In this work, we study how to best apply these two opposite objectives with speaker labels to improve conformer-based ASR. We also propose a novel adaptive gradient reversal layer for stable and effective adversarial training without tuning effort. Detailed analysis and experimental verification are conducted to show the optimal positions in the ASR neural network (NN) to apply speaker enhancing and adversarial training. We also explore their combination for further improvement, achieving the same performance as i-vectors plus adversarial training. Our best speaker-based MTL achieves 7\% relative improvement on the Switchboard Hub5'00 set. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing