Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification
Siqi Yin, Lifan Jiang

TL;DR
This paper presents a multi-method zero-shot image classification framework that leverages ChatGPT, DALL-E, CLIP, and DINO with confidence-based weighting to improve accuracy on standard datasets.
Contribution
It introduces a novel integration framework combining multiple models and alignment strategies with adaptive confidence weighting for enhanced zero-shot learning performance.
Findings
Achieves over 96% AUROC on CIFAR-10, CIFAR-100, and TinyImageNet.
Surpasses 99% AUROC on CIFAR-10.
Significantly outperforms single-model approaches.
Abstract
This paper introduces a novel framework for zero-shot learning (ZSL), i.e., to recognize new categories that are unseen during training, by using a multi-model and multi-alignment integration method. Specifically, we propose three strategies to enhance the model's performance to handle ZSL: 1) Utilizing the extensive knowledge of ChatGPT and the powerful image generation capabilities of DALL-E to create reference images that can precisely describe unseen categories and classification boundaries, thereby alleviating the information bottleneck issue; 2) Integrating the results of text-image alignment and image-image alignment from CLIP, along with the image-image alignment results from DINO, to achieve more accurate predictions; 3) Introducing an adaptive weighting mechanism based on confidence levels to aggregate the outcomes from different prediction methods. Experimental results on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Image Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Dense Connections · Residual Connection · Softmax · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training
