Investigation for Relative Voice Impression Estimation

Kenichi Fujita; Yusuke Ijima

arXiv:2602.14172·cs.SD·February 19, 2026

Investigation for Relative Voice Impression Estimation

Kenichi Fujita, Yusuke Ijima

PDF

Open Access

TL;DR

This paper introduces a framework for estimating perceptual differences between utterances from the same speaker, highlighting the effectiveness of self-supervised speech representations over classical features in capturing subtle impression shifts.

Contribution

It is the first systematic study of relative voice impression estimation, comparing modeling approaches and demonstrating the superiority of self-supervised speech models for this task.

Findings

01

Self-supervised models outperform classical acoustic features in RIE.

02

MLLMs are unreliable for fine-grained impression estimation.

03

Self-supervised speech representations effectively capture complex perceptual shifts.

Abstract

Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Phonetics and Phonology Research