Improving Zero-shot Generalization and Robustness of Multi-modal Models

Yunhao Ge; Jie Ren; Andrew Gallagher; Yuxiao Wang; Ming-Hsuan Yang,; Hartwig Adam; Laurent Itti; Balaji Lakshminarayanan; Jiaping Zhao

arXiv:2212.01758·cs.CV·May 26, 2023

Improving Zero-shot Generalization and Robustness of Multi-modal Models

Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang,, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, Jiaping Zhao

PDF

Open Access 1 Repo

TL;DR

This paper identifies the causes of low top-1 accuracy in multi-modal models like CLIP, proposes a post-hoc uncertainty detection method, and enhances accuracy by leveraging WordNet hierarchy, significantly improving performance without extra training.

Contribution

It introduces a simple, scalable method to identify uncertain predictions and improves accuracy by incorporating semantic hierarchy into prompts, addressing the top-1 gap in zero-shot models.

Findings

01

Improved top-1 accuracy by 17.13% on uncertain images

02

Enhanced overall accuracy by 3.6% on ImageNet validation set

03

Method outperforms max logit baseline in selective prediction

Abstract

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gyhandy/hierarchy-clip
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training