On the Emotion Understanding of Synthesized Speech

Yuan Ge; Haishu Zhao; Aokai Hao; Junxiang Zhang; Bei Li; Xiaoqian Liu; Chenglong Wang; Jianjin Wang; Bingsen Zhou; Bingyu Liu; Jingbo Zhu; Zhengtao Yu; Tong Xiao

arXiv:2603.16483·cs.CL·March 18, 2026

On the Emotion Understanding of Synthesized Speech

Yuan Ge, Haishu Zhao, Aokai Hao, Junxiang Zhang, Bei Li, Xiaoqian Liu, Chenglong Wang, Jianjin Wang, Bingsen Zhou, Bingyu Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao

PDF

Open Access

TL;DR

This paper critically evaluates whether current speech emotion recognition models can accurately understand emotion in synthesized speech, revealing significant limitations due to representation mismatch and reliance on textual semantics.

Contribution

It systematically assesses SER models on synthesized speech, highlighting their inability to generalize and exposing the challenges in capturing paralinguistic cues.

Findings

01

SER models do not generalize well to synthesized speech

02

Representation mismatch caused by speech token prediction affects emotion recognition

03

Generative SLMs rely more on textual semantics than paralinguistic cues

Abstract

Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Mental Health via Writing · Speech Recognition and Synthesis