AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

Ashutosh Chaubey; Jiacheng Pang; Maksim Siniukov; Mohammad Soleymani

arXiv:2602.07054·cs.LG·February 10, 2026

AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

Ashutosh Chaubey, Jiacheng Pang, Maksim Siniukov, Mohammad Soleymani

PDF

Open Access 1 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces AVERE, a method to enhance audiovisual emotion reasoning in multimodal large language models by addressing spurious associations and hallucinations through preference optimization and a new benchmark.

Contribution

It presents a novel preference optimization technique, AVEm-DPO, and a benchmark EmoReAlM for evaluating and improving multimodal models' emotion understanding capabilities.

Findings

01

Significant performance improvements of 6-19% in zero-shot settings.

02

Effective mitigation of modality-specific cue hallucinations.

03

Enhanced alignment of model responses with audiovisual inputs.

Abstract

Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- EmoReAlM benchmark: A comprehensive, human-verified benchmark for audiovisual emotion understanding that tests (a) cue-emotion associations, (b) modality agreement, and © robust stress tests designed to reveal spurious associations and hallucinations. The benchmark includes balanced tasks, adversarial cases, and metrics for spurious associations, modality agreement, and hallucination. - AVEm-DPO optimization framework: A novel preference optimization method tailored to audiovisual emotion reas

Weaknesses

- Reliance on proprietary or large LLM tooling for some steps: The paper mentions using GPT-5 to polish text and using LLMs for annotation/evaluation. This reliance can raise reproducibility concerns if those tools or their prompts are not fully disclosed; it may also bias dataset construction and evaluation unless careful controls are provided. - Potential dataset and evaluation biases: Although the benchmark is human-verified, the document suggests many generated QA items and uses subtitled no

Reviewer 02Rating 4Confidence 4

Strengths

- A comprehensive suite of 4,000 human-verified multiple-choice questions (MCQs) across 2,649 unique videos, designed to evaluate three critical aspects of emotion reasoning. - A multimodal direct preference optimization (DPO) method to align MLLMs with both audiovisual inputs and emotion-centric queries. - Demonstrates that AVEm-DPO outperforms baselines by 6–19% in zero-shot settings across existing benchmarks and EmoReAlM, with qualitative and user studies confirming reduced hallucinations an

Weaknesses

- EmoReAlM is derived exclusively from the DFEW dataset, which may limit generalizability to videos with different cultural contexts, demographics, or emotion types - AVEm-DPO’s training data is generated automatically via Gemini 2.5 (without human verification). While the authors report performance gains, unvalidated preference pairs may introduce hidden biases

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper is well-organized and easy to follow, with clear and informative tables and figures that effectively support the presentation. 2. To reduce hallucinations, the authors propose using Direct Preference Optimization (DPO). The method incorporates fine-grained, modality-level preferences based on the input text and reasoning about whether a response is hallucinatory or relevant to emotion prediction. Additionally, a text-prior debiasing strategy is introduced to mitigate hallucination e

Weaknesses

1. What is the motivation to use LLMs for visual and audio emotion prediction? It is challenging for LLMs to accurately infer emotions based solely on captions, even for advanced models such as GPT-4o. Moreover, even when an LLM’s prediction matches the ground truth, it does not necessarily imply that the emotional trigger or the reasoning process behind the prediction is correct. 2. Is there any analysis on the individual roles of the visual and audio modalities? For example, which modality pr

Code & Models

Models

🤗
chaubeyG/AVERE-7B
model· 61 dl
61 dl

Datasets

chaubeyG/EmoReAlM
dataset· 72 dl
72 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Multimodal Machine Learning Applications · Sentiment Analysis and Opinion Mining