Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment
Zhiyu Lin, Jingwen Yang, Jiale Zhao, Meng Liu, Sunzhu Li, Benyou Wang

TL;DR
DeEAR is a novel framework that objectively measures speech expressiveness by aligning human preferences with automated scores across emotion, prosody, and spontaneity, enhancing evaluation and development of speech models.
Contribution
It introduces DeEAR, an efficient, psychology-grounded method for quantifying speech expressiveness aligned with human perception, enabling better benchmarking and data curation.
Findings
Achieves SRCC of 0.86 with human perception using fewer than 500 samples.
Improves S2S speech expressiveness scores from 2.0 to 23.4 on a 100-point scale.
Creates the ExpressiveSpeech dataset with 14K expressive utterances.
Abstract
Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Phonetics and Phonology Research · Speech Recognition and Synthesis
