Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
Matthew Trager, Pramuditha Perera, Luca Zancato, Alessandro Achille,, Parminder Bhatia, Stefano Soatto

TL;DR
This paper explores the geometric and probabilistic structures of embeddings in vision-language models, demonstrating how simple linear operations can improve interpretability and task performance.
Contribution
It introduces a geometric framework for understanding compositionality in VLM embeddings and empirically shows the effectiveness of linear operations for various vision-language tasks.
Findings
Linear algebraic operations enable interpretable manipulation of embeddings
Compositional structures improve performance in classification and retrieval
Embedding vectors can be approximated as combinations of 'ideal words'
Abstract
We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within the embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
