AcoustEmo: Open-Vocabulary Emotion Reasoning via Utterance-Aware Acoustic Q-Former

Liyun Zhang; Xuanmeng Sha; Shuqiong Wu; Fengkai Liu

arXiv:2603.20894·cs.MM·March 24, 2026

AcoustEmo: Open-Vocabulary Emotion Reasoning via Utterance-Aware Acoustic Q-Former

Liyun Zhang, Xuanmeng Sha, Shuqiong Wu, Fengkai Liu

PDF

Open Access

TL;DR

AcoustEmo introduces a time-sensitive multimodal model with an utterance-aware acoustic Q-Former that captures fine-grained, local acoustic features for improved open-vocabulary emotion recognition in dialogues.

Contribution

The paper presents a novel Utterance-Aware Acoustic Q-Former that dynamically extracts segment-level audio tokens, enabling detailed temporal acoustic modeling within a multimodal large language framework.

Findings

01

Outperforms baselines on EMER task

02

Enhances complex emotion reasoning

03

Maintains robust contextual accuracy

Abstract

Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation shifts within individual utterances. To address this, we propose AcoustEmo, a time-sensitive MLLM featuring a novel Utterance-Aware Acoustic Q-Former. Our approach utilizes a timestamp-synchronized sliding window to dynamically extract segment-level audio tokens instead of coarse global representations. This enables the model to explicitly trace the temporal evolution of subtle acoustic clues and capture deep contextual dependencies in dialogues. Experiments on the Explainable Multimodal Emotion Recognition (EMER) task show that AcoustEmo significantly enhances complex emotion reasoning, outperforming baselines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Social Robot Interaction and HRI