Real Classification by Description: Extending CLIP's Limits of Part   Attributes Recognition

Ethan Baron; Idan Tankel; Peter Tu; Guy Ben-Yosef

arXiv:2412.13947·cs.CV·December 19, 2024

Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition

Ethan Baron, Idan Tankel, Peter Tu, Guy Ben-Yosef

PDF

Open Access

TL;DR

This paper explores zero-shot classification by description using CLIP, addressing its limitations in recognizing detailed object attributes without class names, and introduces new datasets, methods, and architectural modifications to improve part-attribute detection.

Contribution

It introduces a new zero-shot classification task based on descriptions, a dataset for evaluation, and architectural enhancements to CLIP for better part-attribute recognition.

Findings

01

Improved CLIP performance on fine-grained attribute detection

02

New datasets for zero-shot description-based classification

03

Enhanced CLIP architecture with multi-resolution approach

Abstract

In this study, we define and tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) like CLIP to classify objects based solely on descriptive attributes, excluding object class names. This approach highlights the current limitations of VLMs in understanding intricate object descriptions, pushing these models beyond mere object recognition. To facilitate this exploration, we introduce a new challenge and release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning within the research community. Additionally, we propose a method to enhance CLIP's attribute detection capabilities through targeted training using ImageNet21k's diverse object categories, paired with rich attribute descriptions generated by large language models. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction

MethodsContrastive Language-Image Pre-training