Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization
Xiaoxue Gao, Chen Zhang, Yiming Chen, Huayun Zhang, Nancy F. Chen

TL;DR
Emo-DPO introduces a controllable emotional speech synthesis method that uses direct preference optimization and emotion-aware LLM-TTS architecture to better capture emotional nuances and improve synthesis quality.
Contribution
The paper presents a novel Emo-DPO approach combining preference optimization with LLM-TTS architecture for enhanced emotional speech synthesis.
Findings
Outperforms existing baselines in emotional speech synthesis
Effectively captures subtle emotional nuances
Leverages LLM capabilities for improved control
Abstract
Current emotional text-to-speech (TTS) models predominantly conduct supervised training to learn the conversion from text and desired emotion to its emotional speech, focusing on a single emotion per text-speech pair. These models only learn the correct emotional outputs without fully comprehending other emotion characteristics, which limits their capabilities of capturing the nuances between different emotions. We propose a controllable Emo-DPO approach, which employs direct preference optimization to differentiate subtle emotional nuances between emotions through optimizing towards preferred emotions over less preferred emotional ones. Instead of relying on traditional neural architectures used in existing emotional TTS models, we propose utilizing the emotion-aware LLM-TTS neural architecture to leverage LLMs' in-context learning and instruction-following capabilities. Comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition
