Text adaptation for speaker verification with speaker-text factorized embeddings
Yexin Yang, Shuai Wang, Xun Gong, Yanmin Qian, Kai Yu

TL;DR
This paper introduces a novel text adaptation framework using speaker-text factorized embeddings to improve text-dependent speaker verification performance under text mismatch conditions.
Contribution
The paper proposes a speaker-text factorization network that separates speaker and text embeddings, enabling effective text adaptation with minimal data.
Findings
Text adaptation significantly improves SV performance under text mismatch.
The proposed method effectively extracts text embeddings for adaptation.
Experiments on RSR2015 validate the approach's effectiveness.
Abstract
Text mismatch between pre-collected data, either training data or enrollment data, and the actual test data can significantly hurt text-dependent speaker verification (SV) system performance. Although this problem can be solved by carefully collecting data with the target speech content, such data collection could be costly and inflexible. In this paper, we propose a novel text adaptation framework to address the text mismatch issue. Here, a speaker-text factorization network is proposed to factorize the input speech into speaker embeddings and text embeddings and then integrate them into a single representation in the later stage. Given a small amount of speaker-independent adaptation utterances, text embeddings of target speech content can be extracted and used to adapt the text-independent speaker embeddings to text-customized speaker embeddings. Experiments on RSR2015 show that text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
