The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm
Karan Goyal

TL;DR
This paper critically examines the trustworthiness of current Vision-Language Models, proposing a new information-theoretic framework to measure and improve genuine multimodal reasoning beyond dataset biases.
Contribution
It introduces the Modality Translation Protocol and three novel metrics to quantify the visual knowledge bottleneck, challenging existing evaluation methods and guiding future model design.
Findings
State-of-the-art models often exploit language priors, bypassing visual data.
The proposed metrics reveal the expense and limitations of current visual reasoning.
Scaling language models may increase visual knowledge bottleneck penalties.
Abstract
The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
