Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation

Xining Song; Zhihua Wei; Rui Wang; Haixiao Hu; Yanxiang Chen; Meng Han

arXiv:2512.06304·eess.AS·December 9, 2025

Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation

Xining Song, Zhihua Wei, Rui Wang, Haixiao Hu, Yanxiang Chen, Meng Han

PDF

Open Access

TL;DR

This paper provides a comprehensive overview of how input manipulations affect voice conversion models, highlighting their vulnerabilities to degraded speech and exploring attack and defense strategies for robustness.

Contribution

It classifies existing attack and defense methods based on input manipulation and evaluates their impact on VC performance across multiple perceptual dimensions.

Findings

01

Degraded input speech significantly reduces VC output quality.

02

Current models are vulnerable to various input manipulation attacks.

03

Future research should focus on enhancing robustness and defense mechanisms.

Abstract

Identity, accent, style, and emotions are essential components of human speech. Voice conversion (VC) techniques process the speech signals of two input speakers and other modalities of auxiliary information such as prompts and emotion tags. It changes para-linguistic features from one to another, while maintaining linguistic contents. Recently, VC models have made rapid advancements in both generation quality and personalization capabilities. These developments have attracted considerable attention for diverse applications, including privacy preservation, voice-print reproduction for the deceased, and dysarthric speech recovery. However, these models only learn non-robust features due to the clean training data. Subsequently, it results in unsatisfactory performances when dealing with degraded input speech in real-world scenarios, including additional noise, reverberation, adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Emotion and Mood Recognition