ZeroDiff: Solidified Visual-Semantic Correlation in Zero-Shot Learning

Zihan Ye; Shreyank N. Gowda; Xiaowei Huang; Haotian Xu; Yaochu Jin,; Kaizhu Huang; Xiaobo Jin

arXiv:2406.02929·cs.CV·February 12, 2025·1 cites

ZeroDiff: Solidified Visual-Semantic Correlation in Zero-Shot Learning

Zihan Ye, Shreyank N. Gowda, Xiaowei Huang, Haotian Xu, Yaochu Jin,, Kaizhu Huang, Xiaobo Jin

PDF

Open Access 3 Reviews

TL;DR

ZeroDiff introduces a diffusion-based generative framework with contrastive learning to improve zero-shot learning, especially under limited training data, by enhancing visual-semantic correlations and reducing overfitting.

Contribution

The paper proposes ZeroDiff, a novel ZSL method combining diffusion augmentation, supervised-contrastive representations, and multiple discriminators for robust unseen class recognition.

Findings

01

ZeroDiff outperforms existing ZSL methods on benchmark datasets.

02

It maintains high performance even with scarce training data.

03

Extensive experiments validate the effectiveness of the proposed approach.

Abstract

Zero-shot Learning (ZSL) aims to enable classifiers to identify unseen classes. This is typically achieved by generating visual features for unseen classes based on learned visual-semantic correlations from seen classes. However, most current generative approaches heavily rely on having a sufficient number of samples from seen classes. Our study reveals that a scarcity of seen class samples results in a marked decrease in performance across many generative ZSL techniques. We argue, quantify, and empirically demonstrate that this decline is largely attributable to spurious visual-semantic correlations. To address this issue, we introduce ZeroDiff, an innovative generative framework for ZSL that incorporates diffusion mechanisms and contrastive representations to enhance visual-semantic correlations. ZeroDiff comprises three key components: (1) Diffusion augmentation, which naturally…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

(1) The proposed method is reasonable and technically solid. To be specific, although addressing the limited data issue by generating more data is quite straightforward, the details of the structure are still novel and effective. For example, the mutual learning mechanism of the discriminators and the incorporation of the diffusion module are well designed. (2) The proposed method is effective. As shown in the experiments, the proposed method can achieve better performances than existing methods

Weaknesses

(1) The main manuscript is quite confusing without the appendix. For example, how to finetune the feature extractors, and the details of training and testing are not clearly presented in the main manuscript, making it somehow hard to understand the proposed method when reading the main paper. It would be better if the authors could briefly explain and highlight such key details in the main paper as well as explain them in detail in the appendix. (2) It would be better to conduct more experiments

Reviewer 02Rating 6Confidence 5

Strengths

- The analysis of the performance degradation of ZSL due to a spurious visual-semantic correlation learned from a limited number of seen samples is inspiring. - The proposed diffusion augmentation and dynamic semantics methods are interesting.

Weaknesses

- Identify and highlight the 1-2 most critical components that provide the key insights. The proposed pipeline is quite complex. It might be hard to tune the whole model. Moreover, it is also hard to know what is the key insight of the proposed method, since there are too many components. - Provide more motivations and justifications on the design choices of the key components. For example, 1) a clear explanation of the complementary benefits of using both CE and SC loss-based features should be

Reviewer 03Rating 6Confidence 5

Strengths

1. The paper proposes a novel diffusion-based generative method for ZSL. 2. The experiments are comprehensive.

Weaknesses

1. The paper claims that generative-based methods learn spurious visual-semantic relationships when training data is insufficient. Is the conclusion applicable to non-generative ZSL methods? 2. As shown in Fig.2, \delta_{adv} increases over the three datasets as the training progresses. However, the paper aims to learn a model with low \delta_{adv} values. Does that mean we are getting a worse model as the training continues? It isn't very clear. The authors need to clarify the relationship betw

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning

MethodsSparse Evolutionary Training · Diffusion