Improving speaker turn embedding by crossmodal transfer learning from face embedding
Nam Le, Jean-Marc Odobez

TL;DR
This paper introduces three transfer learning methods that leverage face embeddings to improve speaker turn embeddings, especially for short utterances, by exploiting shared latent properties between face and voice data.
Contribution
The paper proposes novel transfer learning approaches from face to speaker embeddings, enhancing speaker verification and clustering performance.
Findings
Significant improvement in speaker verification accuracy.
Enhanced clustering of speaker turns with short utterances.
Insights into the shared properties of face and voice embeddings.
Abstract
Learning speaker turn embeddings has shown considerable improvement in situations where conventional speaker modeling approaches fail. However, this improvement is relatively limited when compared to the gain observed in face embedding learning, which has been proven very successful for face verification and clustering tasks. Assuming that face and voices from the same identities share some latent properties (like age, gender, ethnicity), we propose three transfer learning approaches to leverage the knowledge from the face domain (learned from thousands of images and identities) for tasks in the speaker domain. These approaches, namely target embedding transfer, relative distance transfer, and clustering structure transfer, utilize the structure of the source face embedding space at different granularities to regularize the target speaker turn embedding space as optimizing terms. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
