When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Qilang Ye; Wei Zeng; Meng Liu; Jie Zhang; Yupeng Hu; Zitong Yu; Yu Zhou

arXiv:2511.10059·cs.CV·November 14, 2025

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zitong Yu, Yu Zhou

PDF

Open Access 1 Video

TL;DR

This paper introduces AV-ConfuseBench, a benchmark for testing if Multimodal Large Language Models can distinguish audio-visual confusion, and proposes RL-CoMM, a reinforcement learning approach, to improve their audio-visual reasoning accuracy.

Contribution

The paper presents a new benchmark for audio-visual confusion and a reinforcement learning-based method to enhance MLLMs' reasoning capabilities in ambiguous scenarios.

Findings

01

MLLMs struggle with audio-visual confusion due to visual dominance.

02

RL-CoMM improves reasoning accuracy by 10-30% over baseline models.

03

The approach effectively reduces reasoning uncertainty in audio-visual tasks.

Abstract

Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis