Speaker Adaptation for Attention-Based End-to-End Speech Recognition
Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong

TL;DR
This paper introduces three regularization-based speaker adaptation methods for attention-based end-to-end speech recognition, significantly improving word error rates with limited adaptation data.
Contribution
It proposes novel regularization techniques—KLD, adversarial learning, and multi-task training—for speaker adaptation in AED models, enhancing performance with minimal data.
Findings
Achieved up to 12.2% WER reduction on Microsoft dictation task.
Effective adaptation with limited data, both supervised and unsupervised.
All three methods outperform baseline speaker-independent models.
Abstract
We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition. The first method is Kullback-Leibler divergence (KLD) regularization, in which the output distribution of a speaker-dependent (SD) AED is forced to be close to that of the speaker-independent (SI) model by adding a KLD regularization to the adaptation criterion. To compensate for the asymmetric deficiency in KLD regularization, an adversarial speaker adaptation (ASA) method is proposed to regularize the deep-feature distribution of the SD AED through the adversarial learning of an auxiliary discriminator and the SD AED. The third approach is the multi-task learning, in which an SD AED is trained to jointly perform the primary task of predicting a large number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
