Visual Classification via Description from Large Language Models

Sachit Menon; Carl Vondrick

arXiv:2210.07183·cs.CV·December 2, 2022·57 cites

Visual Classification via Description from Large Language Models

Sachit Menon, Carl Vondrick

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper introduces a classification framework using vision-language models that relies on descriptive features rather than broad categories, enhancing interpretability, adaptability, and bias mitigation in image recognition tasks.

Contribution

The paper proposes a novel classification method based on descriptive features obtained via large language models, improving interpretability and adaptability of vision-language models.

Findings

01

Improved accuracy on ImageNet across distribution shifts.

02

Enhanced ability to recognize unseen concepts.

03

Effective bias mitigation through descriptor editing.

Abstract

Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Visual Classification via Description from Large Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training