Deep Representation Decomposition for Rate-Invariant Speaker Verification
Fuchuan Tong, Siqi Zheng, Haodong Zhou, Xingjia Xie, Qingyang Hong,, Lin Li

TL;DR
This paper introduces a deep learning method that decomposes speaker embeddings into rate-invariant features using adversarial training, improving speaker verification performance across speaking styles.
Contribution
It proposes a novel deep representation decomposition with adversarial learning to achieve speaking rate-invariant speaker embeddings, addressing variability issues in speaker verification.
Findings
Improved verification accuracy on VoxCeleb1 and HI-MIA datasets.
Effective reduction of speaking rate influence on speaker embeddings.
Demonstrated robustness of identity features against speaking rate variations.
Abstract
While promising performance for speaker verification has been achieved by deep speaker embeddings, the advantage would reduce in the case of speaking-style variability. Speaking rate mismatch is often observed in practical speaker verification systems, which may actually degrade the system performance. To reduce intra-class discrepancy caused by speaking rate, we propose a deep representation decomposition approach with adversarial learning to learn speaking rate-invariant speaker embeddings. Specifically, adopting an attention block, we decompose the original embedding into an identity-related component and a rate-related component through multi-task training. Additionally, to reduce the latent relationship between the two decomposed components, we further propose a cosine mapping block to train the parameters adversarially to minimize the cosine similarity between the two decomposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
