Source-Modality Monitoring in Vision-Language Models
Etha Tianze Hua, Tian Yun, Ellie Pavlick

TL;DR
This paper explores how vision-language models track and communicate the origin of information from different input modalities, revealing the roles of syntactic and semantic signals in modality binding.
Contribution
It introduces the concept of source-modality monitoring in multimodal models and evaluates how models exploit syntactic versus semantic cues for modality binding.
Findings
Semantic signals tend to outweigh syntactic signals when modalities are distinct.
Both syntactic and semantic cues are important for modality binding.
Findings have implications for model robustness and multimodal system design.
Abstract
We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
