Same Words, Different Judgments: How Preferences Vary Across Modalities
Aaron Broukhim, Nadir Weibel, Eshin Jolly

TL;DR
This study investigates how human and synthetic preferences differ across text and speech modalities in AI evaluation, revealing significant modality-specific differences and the need for tailored protocols.
Contribution
It provides the first controlled cross-modal comparison of preference annotations, highlighting differences in reporting and agreement between text and audio evaluations.
Findings
Audio preferences show narrower decision thresholds and less bias.
Synthetic ratings can predict inter-rater agreement effectively.
Modality-specific evaluation protocols are necessary for audio data.
Abstract
Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences. However, evaluation protocols for such data were designed for text and have not been validated for speech. We present the first ICC-based, controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. We show that achieving agreement within either modality (ICC(2,) .80) requires 9 raters. At the same time, modalities show marked differences in how people report preferences: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. We demonstrate that synthetic ratings can be used to effectively predict inter-rater agreement, thus serving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
