Vision-Language Models Create Cross-Modal Task Representations

Grace Luo; Trevor Darrell; Amir Bar

arXiv:2410.22330·cs.CV·May 8, 2025

Vision-Language Models Create Cross-Modal Task Representations

Grace Luo, Trevor Darrell, Amir Bar

PDF

Open Access

TL;DR

This paper reveals that vision-language models create shared, modality-invariant task vectors that enable cross-modal transfer and simplify internal processing, advancing understanding of their internal representations.

Contribution

It introduces the concept of shared task vectors in VLMs, demonstrating their invariance, transferability, and effectiveness over full task prompts, providing new insights into model representations.

Findings

01

Shared task vectors are modality-invariant and align conceptually equivalent inputs.

02

A single task vector can outperform full prompts in triggering correct responses.

03

Task vectors can be transferred across models and derived from instructions alone.

Abstract

Autoregressive vision-language models (VLMs) can handle many tasks within a single model, yet the representations that enable this capability remain opaque. We find that VLMs align conceptually equivalent inputs into a shared task vector, which is invariant to modality (text, image) and format (examples, instruction), and may simplify VLM processing. We measure this alignment via cross-modal transfer -- the ability of a task vector derived in one modality to trigger the correct generation in another -- on a range of tasks and model architectures. Although the task vector is highly compressed, we find that this single vector outperforms prompting the model with the full task information, unique to this cross-modal case. Furthermore, we show that task vectors can be transferred from a base language model to its fine-tuned vision-language counterpart, and that they can be derived solely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language and cultural evolution · Language, Metaphor, and Cognition

MethodsBalanced Selection · ALIGN