CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

Pan Yang; Cheng Deng; Jing Yang; Han Zhao; Yun Liu; Yuling Chen; Xiaoli Ruan; Yanping Chen

arXiv:2511.16378·cs.CV·November 21, 2025

CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

Pan Yang, Cheng Deng, Jing Yang, Han Zhao, Yun Liu, Yuling Chen, Xiaoli Ruan, Yanping Chen

PDF

Open Access

TL;DR

CAMS introduces a novel approach for compositional zero-shot learning by extracting fine-grained semantic features through gated cross-attention and multi-space disentanglement, significantly improving generalization to unseen attribute-object pairs.

Contribution

The paper proposes CAMS, a method that enhances CZSL by extracting detailed semantic features and disentangling attribute and object representations in multiple spaces, surpassing previous methods.

Findings

01

Achieves state-of-the-art results on MIT-States, UT-Zappos, and C-GQA benchmarks.

02

Effectively disentangles attribute and object semantics for better unseen composition recognition.

03

Demonstrates improved generalization in both closed-world and open-world CZSL settings.

Abstract

Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Generative Adversarial Networks and Image Synthesis