Describing Sets of Images with Textual-PCA
Oded Hupert, Idan Schwartz, Lior Wolf

TL;DR
This paper introduces a method called Textual-PCA that semantically describes image sets by generating phrases that capture both the common attributes and variations within the set, using pretrained vision-language models.
Contribution
It proposes a novel approach replacing PCA projection vectors with generated phrases to describe image sets semantically, capturing both central themes and variations.
Findings
Effectively captures the essence of image sets.
Generates meaningful descriptions of individual images.
Uses pretrained models for semantic similarity and variation analysis.
Abstract
We seek to semantically describe a set of images, capturing both the attributes of single images and the variations within the set. Our procedure is analogous to Principle Component Analysis, in which the role of projection vectors is replaced with generated phrases. First, a centroid phrase that has the largest average semantic similarity to the images in the set is generated, where both the computation of the similarity and the generation are based on pretrained vision-language models. Then, the phrase that generates the highest variation among the similarity scores is generated, using the same models. The next phrase maximizes the variance subject to being orthogonal, in the latent space, to the highest-variance phrase, and the process continues. Our experiments show that our method is able to convincingly capture the essence of image sets and describe the individual elements in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
