Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
Trung Nguyen Quang, Yiming Gao, Fanyi Pu, Kaichen Zhang, Shuo Sun, Ziwei Liu

TL;DR
This paper investigates the perception-action gap in omnimodal large language models, revealing that models encode sensory mismatches internally but often fail to act on them, especially in audio grounding.
Contribution
Introduces IMAVB, a new benchmark for testing conflict detection in omnimodal models, and proposes a diagnostic method to improve their grounding behavior.
Findings
Models encode premise-perception mismatches internally.
Models often fail to reject false premises in outputs.
Prompt-resistant modality asymmetry observed, especially in audio grounding.
Abstract
When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
