CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification

Marcos V. Conde; Kerem Turgutlu

arXiv:2204.14244·cs.CV·May 2, 2022

CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification

Marcos V. Conde, Kerem Turgutlu

PDF

2 Repos

TL;DR

This paper introduces CLIP-Art, a novel approach leveraging CLIP's contrastive pre-training to improve fine-grained artwork classification and retrieval, especially in scenarios with limited annotated data.

Contribution

It is among the first to apply CLIP to artwork images and text, enabling zero-shot fine-grained recognition without extensive labeled datasets.

Findings

01

Achieved competitive results on the iMet Dataset using self-supervision.

02

Demonstrated effective zero-shot artwork attribute recognition.

03

Improved instance retrieval in artwork datasets.

Abstract

Existing computer vision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. To the best of our knowledge, we are one of the first methods to use CLIP (Contrastive Language-Image Pre-Training) to train a neural network on a variety of artwork images and text descriptions pairs. CLIP is able to learn directly from free-form art descriptions, or, if available, curated fine-grained labels. Model's zero-shot capability allows predicting accurate natural language description for a given image, without directly optimizing for the task. Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition. We use the iMet Dataset, which we consider the largest annotated artwork dataset. In this benchmark we achieved competitive results using only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training