Parts of Speech-Grounded Subspaces in Vision-Language Models
James Oldfield, Christos Tzelepis, Yannis Panagakis, Mihalis A., Nicolaou, Ioannis Patras

TL;DR
This paper introduces a method to disentangle visual attributes in vision-language models like CLIP by associating parts of speech with specific visual variations, improving interpretability and control.
Contribution
The paper proposes a component analysis model that learns subspaces aligned with parts of speech, enabling disentangled and interpretable visual representations in CLIP.
Findings
Successfully separates visual attributes related to parts of speech
Enables removal of specific visual themes from generated images
Improves zero-shot classification performance
Abstract
Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
