Parts of Speech-Grounded Subspaces in Vision-Language Models

James Oldfield; Christos Tzelepis; Yannis Panagakis; Mihalis A.; Nicolaou; Ioannis Patras

arXiv:2305.14053·cs.CV·November 14, 2023·2 cites

Parts of Speech-Grounded Subspaces in Vision-Language Models

James Oldfield, Christos Tzelepis, Yannis Panagakis, Mihalis A., Nicolaou, Ioannis Patras

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a method to disentangle visual attributes in vision-language models like CLIP by associating parts of speech with specific visual variations, improving interpretability and control.

Contribution

The paper proposes a component analysis model that learns subspaces aligned with parts of speech, enabling disentangled and interpretable visual representations in CLIP.

Findings

01

Successfully separates visual attributes related to parts of speech

02

Enables removal of specific visual themes from generated images

03

Improves zero-shot classification performance

Abstract

Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Parts of Speech–Grounded Subspaces in Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training