Why disentanglement-based speaker anonymization systems fail at   preserving emotions?

\"Unal Ege Gaznepoglu; Nils Peters

arXiv:2501.13000·eess.AS·January 23, 2025

Why disentanglement-based speaker anonymization systems fail at preserving emotions?

\"Unal Ege Gaznepoglu, Nils Peters

PDF

Open Access

TL;DR

This paper investigates why current disentanglement-based speaker anonymization systems fail to preserve emotional content, identifying the lack of emotion information in intermediate representations as a key factor.

Contribution

The study provides a comprehensive evaluation of a state-of-the-art system, revealing the main causes of emotion loss and highlighting the impact of speaker embeddings and synthesis artifacts.

Findings

01

Lack of emotion information in intermediate representations is the main cause.

02

Speaker embeddings learned in a generative context significantly affect emotion preservation.

03

Synthesis artifacts bias emotion recognition towards anger.

Abstract

Disentanglement-based speaker anonymization involves decomposing speech into a semantically meaningful representation, altering the speaker embedding, and resynthesizing a waveform using a neural vocoder. State-of-the-art systems of this kind are known to remove emotion information. Possible reasons include mode collapse in GAN-based vocoders, unintended modeling and modification of emotions through speaker embeddings, or excessive sanitization of the intermediate representation. In this paper, we conduct a comprehensive evaluation of a state-of-the-art speaker anonymization system to understand the underlying causes. We conclude that the main reason is the lack of emotion-related information in the intermediate representation. The speaker embeddings also have a high impact, if they are learned in a generative context. The vocoder's out-of-distribution performance has a smaller impact.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Authorship Attribution and Profiling · Hate Speech and Cyberbullying Detection