Grounding Descriptions in Images informs Zero-Shot Visual Recognition
Shaunak Halbe, Junjiao Tian, K J Joseph, James Seale Smith, Katherine, Stevo, Vineeth N Balasubramanian, Zsolt Kira

TL;DR
This paper introduces GRAIN, a pretraining strategy that aligns image and text representations at multiple levels, significantly improving zero-shot visual recognition, especially for fine-grained and unseen concepts, by leveraging synthetic annotations from large language models.
Contribution
GRAIN is a novel pretraining method that jointly grounds textual descriptions in image regions and aligns global image and caption representations, enhancing zero-shot recognition capabilities.
Findings
Outperforms state-of-the-art on 11 image classification datasets.
Demonstrates strong recognition of novel concepts on Products-2023 dataset.
Improves downstream tasks like retrieval with higher representation quality.
Abstract
Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · COVID-19 diagnosis using AI · Medical Imaging Techniques and Applications
MethodsContrastive Language-Image Pre-training
