Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems

Natalia Tomashenko; Emmanuel Vincent; Marc Tommasi

arXiv:2507.15214·cs.SD·July 22, 2025

Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems

Natalia Tomashenko, Emmanuel Vincent, Marc Tommasi

PDF

TL;DR

This paper introduces a new method for extracting context-dependent duration features from speech to improve voice anonymization attack systems, revealing vulnerabilities in speaker verification and anonymization techniques.

Contribution

It proposes a novel approach for representing speaker characteristics using duration embeddings and develops attack models that outperform existing methods.

Findings

01

Attack models significantly improve speaker verification accuracy.

02

Vulnerabilities are identified in current voice anonymization systems.

03

Duration features can be exploited to compromise speaker privacy.

Abstract

The temporal dynamics of speech, encompassing variations in rhythm, intonation, and speaking rate, contain important and unique information about speaker identity. This paper proposes a new method for representing speaker characteristics by extracting context-dependent duration embeddings from speech temporal dynamics. We develop novel attack models using these representations and analyze the potential vulnerabilities in speaker verification and voice anonymization systems.The experimental results show that the developed attack models provide a significant improvement in speaker verification performance for both original and anonymized data in comparison with simpler representations of speech temporal dynamics reported in the literature.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.