Analysis of Speech Temporal Dynamics in the Context of Speaker   Verification and Voice Anonymization

Natalia Tomashenko; Emmanuel Vincent; Marc Tommasi

arXiv:2412.17164·eess.AS·April 25, 2025

Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization

Natalia Tomashenko, Emmanuel Vincent, Marc Tommasi

PDF

Open Access

TL;DR

This paper examines how speech timing features affect speaker verification and voice anonymization, highlighting the need to modify phonetic durations to enhance privacy and prevent speaker identification.

Contribution

It introduces metrics based on phoneme durations for speaker verification and demonstrates their role in speaker identity leakage in both original and anonymized speech.

Findings

01

Phoneme durations leak speaker identity information.

02

Speaker's speech rate and phonetic durations are crucial for privacy.

03

Modifying phonetic durations can improve voice anonymization.

Abstract

In this paper, we investigate the impact of speech temporal dynamics in application to automatic speaker verification and speaker voice anonymization tasks. We propose several metrics to perform automatic speaker verification based only on phoneme durations. Experimental results demonstrate that phoneme durations leak some speaker information and can reveal speaker identity from both original and anonymized speech. Thus, this work emphasizes the importance of taking into account the speaker's speech rate and, more importantly, the speaker's phonetic duration characteristics, as well as the need to modify them in order to develop anonymization systems with strong privacy protection capacity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing