Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

Wenda Zhang; Hongyu Jin; Siyi Wang; Zhiqiang Wei; Ting Dang

arXiv:2601.14620·eess.AS·January 22, 2026

Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models

Wenda Zhang, Hongyu Jin, Siyi Wang, Zhiqiang Wei, Ting Dang

PDF

Open Access

TL;DR

This paper investigates using large audio-language models to generate synthetic annotations that improve emotion recognition in speech, addressing annotation scarcity and ambiguity issues.

Contribution

It introduces a framework for creating synthetic perceptual proxies with ALMs to augment human annotations in speech emotion recognition.

Findings

01

Synthetic annotations improve emotion distribution accuracy in low-ambiguity cases.

02

Augmentation benefits diminish for highly ambiguous emotions with high human disagreement.

03

Proposed DiME-Aug strategy addresses class imbalance and enables unbiased evaluation.

Abstract

Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Sentiment Analysis and Opinion Mining