Speaker-Invariant Training via Adversarial Learning
Zhong Meng, Jinyu Li, Zhuo Chen, Yong Zhao, Vadim Mazalov, Yifan Gong,, Biing-Hwang (Fred) Juang

TL;DR
This paper introduces a speaker-invariant training method using adversarial multi-task learning to improve speech recognition accuracy by reducing speaker variability without explicit speaker normalization.
Contribution
The novel adversarial training scheme (SIT) learns speaker-invariant features for DNN acoustic models, enhancing ASR performance without relying on speaker-specific transformations.
Findings
Achieved 4.99% relative WER reduction on CHiME-3 dataset.
Further improved WER by 4.86% with unsupervised speaker adaptation.
Demonstrated effectiveness of adversarial multi-task learning in speaker invariance.
Abstract
We propose a novel adversarial multi-task learning scheme, aiming at actively curtailing the inter-talker feature variability while maximizing its senone discriminability so as to enhance the performance of a deep neural network (DNN) based ASR system. We call the scheme speaker-invariant training (SIT). In SIT, a DNN acoustic model and a speaker classifier network are jointly optimized to minimize the senone (tied triphone state) classification loss, and simultaneously mini-maximize the speaker classification loss. A speaker-invariant and senone-discriminative deep feature is learned through this adversarial multi-task learning. With SIT, a canonical DNN acoustic model with significantly reduced variance in its output probabilities is learned with no explicit speaker-independent (SI) transformations or speaker-specific representations used in training or testing. Evaluated on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
