The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models

Alessandro Pietro Serra; Francesco Ortu; Emanuele Panizon; Lucrezia Valeriani; Lorenzo Basile; Alessio Ansuini; Diego Doimo; Alberto Cazzaniga

arXiv:2412.06646·cs.CV·February 2, 2026

The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models

Alessandro Pietro Serra, Francesco Ortu, Emanuele Panizon, Lucrezia Valeriani, Lorenzo Basile, Alessio Ansuini, Diego Doimo, Alberto Cazzaniga

PDF

Open Access 1 Video

TL;DR

This paper explores how native and non-native vision-language models process visual information, revealing that native models use a narrow gate for image-text communication, enabling precise control over image semantics and downstream tasks.

Contribution

It introduces the concept of a narrow gate in native multimodal models and demonstrates its importance for effective image-understanding and controllable image-text communication.

Findings

01

Native VLMs use a single token as a narrow gate for visual info.

02

Ablating this token impairs image-understanding performance.

03

Token-level interventions enable fine-grained control over image semantics.

Abstract

Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, focusing on how visual information is processed and transferred to the textual domain. We compare native multimodal VLMs, models trained from scratch on multimodal data to generate both text and images, and non-native multimodal VLMs, models adapted from pre-trained large language models or capable of generating only text, highlighting key differences in information flow. We find that in native multimodal VLMs, image and text embeddings are more separated within the residual stream. Moreover, VLMs differ in how visual information reaches text: non-native multimodal VLMs exhibit a distributed communication pattern, where information is exchanged through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models· slideslive

Taxonomy

TopicsGeographic Information Systems Studies · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques