Learning Visual Proxy for Compositional Zero-Shot Learning

Shiyu Zhang; Cheng Yan; Yang Liu; Chenchen Jing; Lei Zhou; Wenjun Wang

arXiv:2501.13859·cs.CV·September 3, 2025

Learning Visual Proxy for Compositional Zero-Shot Learning

Shiyu Zhang, Cheng Yan, Yang Liu, Chenchen Jing, Lei Zhou, Wenjun Wang

PDF

Open Access

TL;DR

This paper introduces Visual Proxy Learning and Cross-Modal Joint Learning to improve compositional zero-shot learning by reducing modality gaps and capturing fine-grained visual cues, leading to state-of-the-art results.

Contribution

The paper proposes a novel visual proxy learning method and cross-modal joint learning framework to enhance compositional generalization in CZSL.

Findings

01

Achieves state-of-the-art performance in closed-world CZSL benchmarks.

02

Demonstrates competitive results in open-world CZSL scenarios.

03

Effectively reduces modality gaps and captures fine-grained cues for better discrimination.

Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions by leveraging knowledge from seen compositions. Current methods align textual prototypes with visual features via Vision-Language Models (VLMs), but suffer from two limitations: (1) modality gaps hinder the discrimination of semantically similar pairs, and (2) single-modal textual prototypes lack fine-grained visual cues. In this paper, we introduce Visual Proxy Learning, a method that reduces modality gaps and enhances compositional generalization. We initialize visual proxies for attributes, objects, and their compositions using text representations and optimize the visual space to capture fine-grained cues, improving visual representations. Additionally, we propose Cross-Modal Joint Learning (CMJL), which imposes cross-modal constraints between the text-image and fine-grained visual spaces,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Orthopedic Infections and Treatments

MethodsALIGN