Emo-DPO: Controllable Emotional Speech Synthesis through Direct   Preference Optimization

Xiaoxue Gao; Chen Zhang; Yiming Chen; Huayun Zhang; Nancy F. Chen

arXiv:2409.10157·eess.AS·September 17, 2024

Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Xiaoxue Gao, Chen Zhang, Yiming Chen, Huayun Zhang, Nancy F. Chen

PDF

Open Access

TL;DR

Emo-DPO introduces a controllable emotional speech synthesis method that uses direct preference optimization and emotion-aware LLM-TTS architecture to better capture emotional nuances and improve synthesis quality.

Contribution

The paper presents a novel Emo-DPO approach combining preference optimization with LLM-TTS architecture for enhanced emotional speech synthesis.

Findings

01

Outperforms existing baselines in emotional speech synthesis

02

Effectively captures subtle emotional nuances

03

Leverages LLM capabilities for improved control

Abstract

Current emotional text-to-speech (TTS) models predominantly conduct supervised training to learn the conversion from text and desired emotion to its emotional speech, focusing on a single emotion per text-speech pair. These models only learn the correct emotional outputs without fully comprehending other emotion characteristics, which limits their capabilities of capturing the nuances between different emotions. We propose a controllable Emo-DPO approach, which employs direct preference optimization to differentiate subtle emotional nuances between emotions through optimizing towards preferred emotions over less preferred emotional ones. Instead of relying on traditional neural architectures used in existing emotional TTS models, we propose utilizing the emotion-aware LLM-TTS neural architecture to leverage LLMs' in-context learning and instruction-following capabilities. Comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition