The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Tzu-Wen Hsu, Yun-Man Hsu, Chun Wei Chen, Shrikanth Narayanan, Hung-yi Lee

TL;DR
This paper critically examines emotion embedding similarity metrics in speech generation, revealing they often misalign with human perception due to acoustic vulnerabilities and interference from linguistic and speaker variations.
Contribution
It demonstrates that current emotion similarity metrics are unreliable for zero-shot evaluation and highlights their limitations in capturing genuine emotional expressiveness.
Findings
Emotion embeddings are affected by linguistic and speaker interference.
High classification accuracy does not equate to effective emotion similarity measurement.
Metrics tend to reward acoustic mimicry rather than genuine emotional transfer.
Abstract
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
