Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models
Haoyuan Yang, Mu Yang, Jiamin Xie, Szu-Jui Chen, John H.L. Hansen

TL;DR
This paper introduces Emotion-Aware Prefix, a method that significantly enhances explicit emotion control in voice conversion models, achieving higher emotion accuracy while preserving speech quality and speaker identity.
Contribution
It proposes a novel two-stage approach with joint sequence and acoustic control, improving emotion conversion accuracy and generalizability in zero-shot voice conversion.
Findings
Emotion Conversion Accuracy increased from 42.40% to 85.50%.
Maintains linguistic integrity and speech quality.
Preserves speaker identity despite emotion modulation.
Abstract
Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion backbone. We significantly improve emotion conversion performance, doubling the baseline Emotion Conversion Accuracy (ECA) from 42.40% to 85.50% while maintaining linguistic integrity and speech quality, without compromising speaker identity. Our ablation study suggests that a joint control of both sequence modulation and acoustic realization is essential to synthesize distinct emotions. Furthermore, comparative analysis verifies the generalizability of proposed method, while it provides insights on the role of acoustic decoupling in maintaining speaker identity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Voice and Speech Disorders
