Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment

Zhiyu Lin; Jingwen Yang; Jiale Zhao; Meng Liu; Sunzhu Li; Benyou Wang

arXiv:2510.20513·cs.SD·October 24, 2025

Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment

Zhiyu Lin, Jingwen Yang, Jiale Zhao, Meng Liu, Sunzhu Li, Benyou Wang

PDF

Open Access 1 Datasets

TL;DR

DeEAR is a novel framework that objectively measures speech expressiveness by aligning human preferences with automated scores across emotion, prosody, and spontaneity, enhancing evaluation and development of speech models.

Contribution

It introduces DeEAR, an efficient, psychology-grounded method for quantifying speech expressiveness aligned with human perception, enabling better benchmarking and data curation.

Findings

01

Achieves SRCC of 0.86 with human perception using fewer than 500 samples.

02

Improves S2S speech expressiveness scores from 2.0 to 23.4 on a 100-point scale.

03

Creates the ExpressiveSpeech dataset with 14K expressive utterances.

Abstract

Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

FreedomIntelligence/ExpressiveSpeech
dataset· 113 dl
113 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Phonetics and Phonology Research · Speech Recognition and Synthesis