MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning
Yue Wang, Shuai Xu, Xuelin Zhu, Yicong Li

TL;DR
This paper introduces MSCI, a multi-stage model that enhances CLIP's ability to recognize unseen combinations by leveraging intermediate visual features and adaptive attention mechanisms, improving fine-grained perception in compositional zero-shot learning.
Contribution
MSCI is the first to utilize intermediate-layer information from CLIP's visual encoder with adaptive aggregation and interaction mechanisms for improved CZSL performance.
Findings
MSCI outperforms existing methods on three datasets.
The model effectively captures fine-grained local features.
Adaptive attention improves recognition accuracy.
Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model's perception…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiology practices and education · COVID-19 diagnosis using AI · Infectious Diseases and Tuberculosis
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
