Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Yaniv Nikankin, Dana Arad, Yossi Gandelsman, Yonatan Belinkov

TL;DR
This paper investigates the modality-specific circuits in vision-language models, revealing their differences and similarities, and proposes a simple intervention to reduce the performance gap between visual and textual modalities.
Contribution
It identifies modality-specific circuits in VLMs, analyzes their functionalities, and introduces a training-free method to improve visual data representations, closing part of the modality gap.
Findings
Circuits are largely disjoint between modalities but perform similar functions.
Visual representations align with textual ones only in later layers.
Patching visual tokens from later to earlier layers reduces the modality gap by a third.
Abstract
Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the \textit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
