AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization
Ashutosh Chaubey, Jiacheng Pang, Maksim Siniukov, Mohammad Soleymani

TL;DR
This paper introduces AVERE, a method to enhance audiovisual emotion reasoning in multimodal large language models by addressing spurious associations and hallucinations through preference optimization and a new benchmark.
Contribution
It presents a novel preference optimization technique, AVEm-DPO, and a benchmark EmoReAlM for evaluating and improving multimodal models' emotion understanding capabilities.
Findings
Significant performance improvements of 6-19% in zero-shot settings.
Effective mitigation of modality-specific cue hallucinations.
Enhanced alignment of model responses with audiovisual inputs.
Abstract
Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes…
Peer Reviews
Decision·ICLR 2026 Poster
- EmoReAlM benchmark: A comprehensive, human-verified benchmark for audiovisual emotion understanding that tests (a) cue-emotion associations, (b) modality agreement, and © robust stress tests designed to reveal spurious associations and hallucinations. The benchmark includes balanced tasks, adversarial cases, and metrics for spurious associations, modality agreement, and hallucination. - AVEm-DPO optimization framework: A novel preference optimization method tailored to audiovisual emotion reas
- Reliance on proprietary or large LLM tooling for some steps: The paper mentions using GPT-5 to polish text and using LLMs for annotation/evaluation. This reliance can raise reproducibility concerns if those tools or their prompts are not fully disclosed; it may also bias dataset construction and evaluation unless careful controls are provided. - Potential dataset and evaluation biases: Although the benchmark is human-verified, the document suggests many generated QA items and uses subtitled no
- A comprehensive suite of 4,000 human-verified multiple-choice questions (MCQs) across 2,649 unique videos, designed to evaluate three critical aspects of emotion reasoning. - A multimodal direct preference optimization (DPO) method to align MLLMs with both audiovisual inputs and emotion-centric queries. - Demonstrates that AVEm-DPO outperforms baselines by 6–19% in zero-shot settings across existing benchmarks and EmoReAlM, with qualitative and user studies confirming reduced hallucinations an
- EmoReAlM is derived exclusively from the DFEW dataset, which may limit generalizability to videos with different cultural contexts, demographics, or emotion types - AVEm-DPO’s training data is generated automatically via Gemini 2.5 (without human verification). While the authors report performance gains, unvalidated preference pairs may introduce hidden biases
1. The paper is well-organized and easy to follow, with clear and informative tables and figures that effectively support the presentation. 2. To reduce hallucinations, the authors propose using Direct Preference Optimization (DPO). The method incorporates fine-grained, modality-level preferences based on the input text and reasoning about whether a response is hallucinatory or relevant to emotion prediction. Additionally, a text-prior debiasing strategy is introduced to mitigate hallucination e
1. What is the motivation to use LLMs for visual and audio emotion prediction? It is challenging for LLMs to accurately infer emotions based solely on captions, even for advanced models such as GPT-4o. Moreover, even when an LLM’s prediction matches the ground truth, it does not necessarily imply that the emotional trigger or the reasoning process behind the prediction is correct. 2. Is there any analysis on the individual roles of the visual and audio modalities? For example, which modality pr
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Multimodal Machine Learning Applications · Sentiment Analysis and Opinion Mining
