Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Trung Nguyen Quang; Yiming Gao; Fanyi Pu; Kaichen Zhang; Shuo Sun; Ziwei Liu

arXiv:2605.13737·cs.AI·May 14, 2026

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Trung Nguyen Quang, Yiming Gao, Fanyi Pu, Kaichen Zhang, Shuo Sun, Ziwei Liu

PDF

TL;DR

This paper investigates the perception-action gap in omnimodal large language models, revealing that models encode sensory mismatches internally but often fail to act on them, especially in audio grounding.

Contribution

Introduces IMAVB, a new benchmark for testing conflict detection in omnimodal models, and proposes a diagnostic method to improve their grounding behavior.

Findings

01

Models encode premise-perception mismatches internally.

02

Models often fail to reject false premises in outputs.

03

Prompt-resistant modality asymmetry observed, especially in audio grounding.

Abstract

When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.