Improving Zero-shot Generalization and Robustness of Multi-modal Models
Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang,, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, Jiaping Zhao

TL;DR
This paper identifies the causes of low top-1 accuracy in multi-modal models like CLIP, proposes a post-hoc uncertainty detection method, and enhances accuracy by leveraging WordNet hierarchy, significantly improving performance without extra training.
Contribution
It introduces a simple, scalable method to identify uncertain predictions and improves accuracy by incorporating semantic hierarchy into prompts, addressing the top-1 gap in zero-shot models.
Findings
Improved top-1 accuracy by 17.13% on uncertain images
Enhanced overall accuracy by 3.6% on ImageNet validation set
Method outperforms max logit baseline in selective prediction
Abstract
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
