Improving speaker turn embedding by crossmodal transfer learning from   face embedding

Nam Le; Jean-Marc Odobez

arXiv:1707.02749·cs.CV·July 11, 2017

Improving speaker turn embedding by crossmodal transfer learning from face embedding

Nam Le, Jean-Marc Odobez

PDF

TL;DR

This paper introduces three transfer learning methods that leverage face embeddings to improve speaker turn embeddings, especially for short utterances, by exploiting shared latent properties between face and voice data.

Contribution

The paper proposes novel transfer learning approaches from face to speaker embeddings, enhancing speaker verification and clustering performance.

Findings

01

Significant improvement in speaker verification accuracy.

02

Enhanced clustering of speaker turns with short utterances.

03

Insights into the shared properties of face and voice embeddings.

Abstract

Learning speaker turn embeddings has shown considerable improvement in situations where conventional speaker modeling approaches fail. However, this improvement is relatively limited when compared to the gain observed in face embedding learning, which has been proven very successful for face verification and clustering tasks. Assuming that face and voices from the same identities share some latent properties (like age, gender, ethnicity), we propose three transfer learning approaches to leverage the knowledge from the face domain (learned from thousands of images and identities) for tasks in the speaker domain. These approaches, namely target embedding transfer, relative distance transfer, and clustering structure transfer, utilize the structure of the source face embedding space at different granularities to regularize the target speaker turn embedding space as optimizing terms. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.