Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score
Eman Ali, Sathira Silva, Chetan Arora, Muhammad Haris Khan

TL;DR
This paper introduces FAIR, a novel method for fine-grained unsupervised adaptation of CLIP that dynamically aligns image features with text descriptions, improving pseudo-label accuracy and overall performance.
Contribution
FAIR presents a new adaptive alignment score and interaction refinement technique for better fine-grained adaptation of vision-language models.
Findings
Achieves 2.78% overall gain over SOTA on 13 datasets.
Improves pseudo-label quality through dynamic cross-modal interactions.
Enhances fine-grained classification accuracy in unsupervised settings.
Abstract
Vision-language models (VLMs) like CLIP excel in zero-shot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that cannot capture evolving, subtle class distinctions or use computationally expensive pseudo-labeling strategies that limit scalability. In contrast, we show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels and substantially improves performance over state-of-the-art (SOTA) methods. We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings through a set of Class Description Anchors (CDA). This enables the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Cancer-related molecular mechanisms research
