TL;DR
This paper introduces CLIP-Art, a novel approach leveraging CLIP's contrastive pre-training to improve fine-grained artwork classification and retrieval, especially in scenarios with limited annotated data.
Contribution
It is among the first to apply CLIP to artwork images and text, enabling zero-shot fine-grained recognition without extensive labeled datasets.
Findings
Achieved competitive results on the iMet Dataset using self-supervision.
Demonstrated effective zero-shot artwork attribute recognition.
Improved instance retrieval in artwork datasets.
Abstract
Existing computer vision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. To the best of our knowledge, we are one of the first methods to use CLIP (Contrastive Language-Image Pre-Training) to train a neural network on a variety of artwork images and text descriptions pairs. CLIP is able to learn directly from free-form art descriptions, or, if available, curated fine-grained labels. Model's zero-shot capability allows predicting accurate natural language description for a given image, without directly optimizing for the task. Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition. We use the iMet Dataset, which we consider the largest annotated artwork dataset. In this benchmark we achieved competitive results using only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
