Learning Visual Proxy for Compositional Zero-Shot Learning
Shiyu Zhang, Cheng Yan, Yang Liu, Chenchen Jing, Lei Zhou, Wenjun Wang

TL;DR
This paper introduces Visual Proxy Learning and Cross-Modal Joint Learning to improve compositional zero-shot learning by reducing modality gaps and capturing fine-grained visual cues, leading to state-of-the-art results.
Contribution
The paper proposes a novel visual proxy learning method and cross-modal joint learning framework to enhance compositional generalization in CZSL.
Findings
Achieves state-of-the-art performance in closed-world CZSL benchmarks.
Demonstrates competitive results in open-world CZSL scenarios.
Effectively reduces modality gaps and captures fine-grained cues for better discrimination.
Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions by leveraging knowledge from seen compositions. Current methods align textual prototypes with visual features via Vision-Language Models (VLMs), but suffer from two limitations: (1) modality gaps hinder the discrimination of semantically similar pairs, and (2) single-modal textual prototypes lack fine-grained visual cues. In this paper, we introduce Visual Proxy Learning, a method that reduces modality gaps and enhances compositional generalization. We initialize visual proxies for attributes, objects, and their compositions using text representations and optimize the visual space to capture fine-grained cues, improving visual representations. Additionally, we propose Cross-Modal Joint Learning (CMJL), which imposes cross-modal constraints between the text-image and fine-grained visual spaces,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Orthopedic Infections and Treatments
MethodsALIGN
