Bridging Hidden States in Vision-Language Models
Benjamin Fein-Ashley, Jacob Fein-Ashley

TL;DR
This paper introduces a lightweight, bidirectional attention-based fusion module that aligns hidden states in vision-language models, improving performance on various benchmarks while maintaining efficiency.
Contribution
It proposes a novel fusion approach that directly aligns modality-specific hidden states using cross-attention layers near the top of encoders, enhancing multimodal understanding.
Findings
Outperforms comparable VLMs on retrieval, VQA, and visual reasoning benchmarks.
Maintains the efficiency of contrastive bi-encoder models.
Provides a simple, effective method for modality alignment.
Abstract
Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
