Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models

Haoyuan Yang; Mu Yang; Jiamin Xie; Szu-Jui Chen; John H.L. Hansen

arXiv:2603.09120·eess.AS·March 11, 2026

Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models

Haoyuan Yang, Mu Yang, Jiamin Xie, Szu-Jui Chen, John H.L. Hansen

PDF

Open Access

TL;DR

This paper introduces Emotion-Aware Prefix, a method that significantly enhances explicit emotion control in voice conversion models, achieving higher emotion accuracy while preserving speech quality and speaker identity.

Contribution

It proposes a novel two-stage approach with joint sequence and acoustic control, improving emotion conversion accuracy and generalizability in zero-shot voice conversion.

Findings

01

Emotion Conversion Accuracy increased from 42.40% to 85.50%.

02

Maintains linguistic integrity and speech quality.

03

Preserves speaker identity despite emotion modulation.

Abstract

Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion backbone. We significantly improve emotion conversion performance, doubling the baseline Emotion Conversion Accuracy (ECA) from 42.40% to 85.50% while maintaining linguistic integrity and speech quality, without compromising speaker identity. Our ablation study suggests that a joint control of both sequence modulation and acoustic realization is essential to synthesize distinct emotions. Furthermore, comparative analysis verifies the generalizability of proposed method, while it provides insights on the role of acoustic decoupling in maintaining speaker identity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Voice and Speech Disorders