AcoustEmo: Open-Vocabulary Emotion Reasoning via Utterance-Aware Acoustic Q-Former
Liyun Zhang, Xuanmeng Sha, Shuqiong Wu, Fengkai Liu

TL;DR
AcoustEmo introduces a time-sensitive multimodal model with an utterance-aware acoustic Q-Former that captures fine-grained, local acoustic features for improved open-vocabulary emotion recognition in dialogues.
Contribution
The paper presents a novel Utterance-Aware Acoustic Q-Former that dynamically extracts segment-level audio tokens, enabling detailed temporal acoustic modeling within a multimodal large language framework.
Findings
Outperforms baselines on EMER task
Enhances complex emotion reasoning
Maintains robust contextual accuracy
Abstract
Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation shifts within individual utterances. To address this, we propose AcoustEmo, a time-sensitive MLLM featuring a novel Utterance-Aware Acoustic Q-Former. Our approach utilizes a timestamp-synchronized sliding window to dynamically extract segment-level audio tokens instead of coarse global representations. This enables the model to explicitly trace the temporal evolution of subtle acoustic clues and capture deep contextual dependencies in dialogues. Experiments on the Explainable Multimodal Emotion Recognition (EMER) task show that AcoustEmo significantly enhances complex emotion reasoning, outperforming baselines…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Social Robot Interaction and HRI
