Speaker-agnostic Emotion Vector for Cross-speaker Emotion Intensity Control
Masato Murata, Koichi Miyazaki, Tomoki Koriyama

TL;DR
This paper introduces a speaker-agnostic emotion vector for cross-speaker emotion intensity control in speech synthesis, ensuring speaker consistency and high speech quality even for unseen speakers.
Contribution
The paper proposes a novel speaker-agnostic emotion vector that captures shared emotional expressions, improving cross-speaker emotion control over prior methods.
Findings
Successful cross-speaker emotion intensity control with maintained speaker consistency
Effective in unseen speaker scenarios
High speech quality and controllability achieved
Abstract
Cross-speaker emotion intensity control aims to generate emotional speech of a target speaker with desired emotion intensities using only their neutral speech. A recently proposed method, emotion arithmetic, achieves emotion intensity control using a single-speaker emotion vector. Although this prior method has shown promising results in the same-speaker setting, it lost speaker consistency in the cross-speaker setting due to mismatches between the emotion vector of the source and target speakers. To overcome this limitation, we propose a speaker-agnostic emotion vector designed to capture shared emotional expressions across multiple speakers. This speaker-agnostic emotion vector is applicable to arbitrary speakers. Experimental results demonstrate that the proposed method succeeds in cross-speaker emotion intensity control while maintaining speaker consistency, speech quality, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
