Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Shaunak Halbe; Junjiao Tian; K J Joseph; James Seale Smith; Katherine; Stevo; Vineeth N Balasubramanian; Zsolt Kira

arXiv:2412.04429·cs.CV·December 6, 2024

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Shaunak Halbe, Junjiao Tian, K J Joseph, James Seale Smith, Katherine, Stevo, Vineeth N Balasubramanian, Zsolt Kira

PDF

Open Access 1 Repo

TL;DR

This paper introduces GRAIN, a pretraining strategy that aligns image and text representations at multiple levels, significantly improving zero-shot visual recognition, especially for fine-grained and unseen concepts, by leveraging synthetic annotations from large language models.

Contribution

GRAIN is a novel pretraining method that jointly grounds textual descriptions in image regions and aligns global image and caption representations, enhancing zero-shot recognition capabilities.

Findings

01

Outperforms state-of-the-art on 11 image classification datasets.

02

Demonstrates strong recognition of novel concepts on Products-2023 dataset.

03

Improves downstream tasks like retrieval with higher representation quality.

Abstract

Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shaunak27/grain-clip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in cancer detection · COVID-19 diagnosis using AI · Medical Imaging Techniques and Applications

MethodsContrastive Language-Image Pre-training