On the Emotion Understanding of Synthesized Speech
Yuan Ge, Haishu Zhao, Aokai Hao, Junxiang Zhang, Bei Li, Xiaoqian Liu, Chenglong Wang, Jianjin Wang, Bingsen Zhou, Bingyu Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao

TL;DR
This paper critically evaluates whether current speech emotion recognition models can accurately understand emotion in synthesized speech, revealing significant limitations due to representation mismatch and reliance on textual semantics.
Contribution
It systematically assesses SER models on synthesized speech, highlighting their inability to generalize and exposing the challenges in capturing paralinguistic cues.
Findings
SER models do not generalize well to synthesized speech
Representation mismatch caused by speech token prediction affects emotion recognition
Generative SLMs rely more on textual semantics than paralinguistic cues
Abstract
Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Mental Health via Writing · Speech Recognition and Synthesis
