Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis
Marc-Andr\'e Carbonneau, Benjamin van Niekerk, Hugo Seut\'e, Jean-Philippe Letendre, Herman Kamper, Julian Za\"idi

TL;DR
This paper examines the limitations of current speaker verification embeddings in capturing dynamic voice features and introduces U3D, a new metric for assessing rhythm to improve speaker similarity evaluation in speech synthesis.
Contribution
It identifies the static focus of existing embeddings and proposes U3D to evaluate dynamic rhythm, enhancing speaker identity assessment.
Findings
ASV embeddings mainly capture static features like timbre and pitch.
Current embeddings neglect dynamic elements such as rhythm.
U3D effectively evaluates dynamic rhythm patterns for speaker similarity.
Abstract
Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
