Decoding Ambiguous Emotions with Test-Time Scaling in Audio-Language Models

Hong Jia; Weibin Li; Jingyao Wu; Xiaofeng Yu; Yan Gao; Jintao Cheng; Xiaoyu Tang; Feng Xia; Ting Dang

arXiv:2602.03873·cs.SD·February 5, 2026

Decoding Ambiguous Emotions with Test-Time Scaling in Audio-Language Models

Hong Jia, Weibin Li, Jingyao Wu, Xiaofeng Yu, Yan Gao, Jintao Cheng, Xiaoyu Tang, Feng Xia, Ting Dang

PDF

Open Access

TL;DR

This paper introduces a benchmark for ambiguous emotion recognition in speech using audio-language models and test-time scaling, revealing insights into model capacity and affective ambiguity handling.

Contribution

It is the first to evaluate ALMs with TTS strategies on ambiguous emotion recognition, providing systematic comparison and analysis of their interaction.

Findings

01

ALMs show potential for nuanced affective reasoning without explicit labels.

02

Test-time scaling improves model adaptability to ambiguous emotions.

03

Benchmark highlights challenges and future directions for emotion-aware speech AI.

Abstract

Emotion recognition from human speech is a critical enabler for socially aware conversational AI. However, while most prior work frames emotion recognition as a categorical classification problem, real-world affective states are often ambiguous, overlapping, and context-dependent, posing significant challenges for both annotation and automatic modeling. Recent large-scale audio language models (ALMs) offer new opportunities for nuanced affective reasoning without explicit emotion supervision, but their capacity to handle ambiguous emotions remains underexplored. At the same time, advances in inference-time techniques such as test-time scaling (TTS) have shown promise for improving generalization and adaptability in hard NLP tasks, but their relevance to affective computing is still largely unknown. In this work, we introduce the first benchmark for ambiguous emotion recognition in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech Recognition and Synthesis