Visual Classification via Description from Large Language Models
Sachit Menon, Carl Vondrick

TL;DR
This paper introduces a classification framework using vision-language models that relies on descriptive features rather than broad categories, enhancing interpretability, adaptability, and bias mitigation in image recognition tasks.
Contribution
The paper proposes a novel classification method based on descriptive features obtained via large language models, improving interpretability and adaptability of vision-language models.
Findings
Improved accuracy on ImageNet across distribution shifts.
Enhanced ability to recognize unseen concepts.
Effective bias mitigation through descriptor editing.
Abstract
Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
