Linear Spaces of Meanings: Compositional Structures in Vision-Language   Models

Matthew Trager; Pramuditha Perera; Luca Zancato; Alessandro Achille,; Parminder Bhatia; Stefano Soatto

arXiv:2302.14383·cs.LG·January 12, 2024·1 cites

Linear Spaces of Meanings: Compositional Structures in Vision-Language Models

Matthew Trager, Pramuditha Perera, Luca Zancato, Alessandro Achille,, Parminder Bhatia, Stefano Soatto

PDF

Open Access

TL;DR

This paper explores the geometric and probabilistic structures of embeddings in vision-language models, demonstrating how simple linear operations can improve interpretability and task performance.

Contribution

It introduces a geometric framework for understanding compositionality in VLM embeddings and empirically shows the effectiveness of linear operations for various vision-language tasks.

Findings

01

Linear algebraic operations enable interpretable manipulation of embeddings

02

Compositional structures improve performance in classification and retrieval

03

Embedding vectors can be approximated as combinations of 'ideal words'

Abstract

We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within the embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling