Embeddings for DNN speaker adaptive training
Joanna Rownicka, Peter Bell, Steve Renals

TL;DR
This paper explores embedding-based speaker adaptation for DNNs in speech recognition, comparing different embedding types and adaptation strategies, and demonstrates notable WER improvements with effective embeddings and adaptation methods.
Contribution
It introduces a simplified adaptation approach using a single linear layer on embeddings and evaluates various embeddings for effective speaker adaptation in DNN-based speech recognition.
Findings
A single linear layer on embeddings is as effective as multi-layer adaptation networks.
Embedding quality for speaker recognition does not directly correlate with ASR performance.
Best models achieved 4-9% relative WER reduction over baselines.
Abstract
In this work, we investigate the use of embeddings for speaker-adaptive training of DNNs (DNN-SAT) focusing on a small amount of adaptation data per speaker. DNN-SAT can be viewed as learning a mapping from each embedding to transformation parameters that are applied to the shared parameters of the DNN. We investigate different approaches to applying these transformations, and find that with a good training strategy, a multi-layer adaptation network applied to all hidden layers is no more effective than a single linear layer acting on the embeddings to transform the input features. In the second part of our work, we evaluate different embeddings (i-vectors, x-vectors and deep CNN embeddings) in an additional speaker recognition task in order to gain insight into what should characterize an embedding for DNN-SAT. We find the performance for speaker recognition of a given representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer
