CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification
Qijie Wang, Guandu Liu, Bin Wang

TL;DR
CapS-Adapter introduces a caption-based multimodal support set method that leverages image and caption features to significantly improve zero-shot classification accuracy across diverse datasets without additional training.
Contribution
This work presents CapS-Adapter, a novel zero-shot classification approach using caption-based support sets to enhance generalization and performance over existing training-free methods.
Findings
Achieves 2.19% higher accuracy than previous state-of-the-art methods.
Demonstrates robust generalization across 19 benchmark datasets.
Effectively utilizes multimodal large models for support set construction.
Abstract
Recent advances in vision-language foundational models, such as CLIP, have demonstrated significant strides in zero-shot classification. However, the extensive parameterization of models like CLIP necessitates a resource-intensive fine-tuning process. In response, TIP-Adapter and SuS-X have introduced training-free methods aimed at bolstering the efficacy of downstream tasks. While these approaches incorporate support sets to maintain data distribution consistency between knowledge cache and test sets, they often fall short in terms of generalization on the test set, particularly when faced with test data exhibiting substantial distributional variations. In this work, we present CapS-Adapter, an innovative method that employs a caption-based support set, effectively harnessing both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Anomaly Detection Techniques and Applications · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
