Investigation for Relative Voice Impression Estimation
Kenichi Fujita, Yusuke Ijima

TL;DR
This paper introduces a framework for estimating perceptual differences between utterances from the same speaker, highlighting the effectiveness of self-supervised speech representations over classical features in capturing subtle impression shifts.
Contribution
It is the first systematic study of relative voice impression estimation, comparing modeling approaches and demonstrating the superiority of self-supervised speech models for this task.
Findings
Self-supervised models outperform classical acoustic features in RIE.
MLLMs are unreliable for fine-grained impression estimation.
Self-supervised speech representations effectively capture complex perceptual shifts.
Abstract
Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Phonetics and Phonology Research
