AIR: Zero-shot Generative Model Adaptation with Iterative Refinement
Guimeng Liu, Milad Abdollahzadeh, Ngai-Man Cheung

TL;DR
This paper introduces AIR, a novel zero-shot generative model adaptation method that iteratively refines image quality by addressing offset misalignment in CLIP embedding space, achieving state-of-the-art results.
Contribution
The paper provides the first empirical analysis of offset misalignment in CLIP space and proposes AIR, a new iterative refinement approach for improved zero-shot domain adaptation.
Findings
Offset misalignment correlates with concept distance in CLIP space.
AIR outperforms existing methods in image quality and domain adaptation.
Qualitative, quantitative, and user studies confirm AIR's superior performance.
Abstract
Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The paper introduces a novel approach to Zero-shot Generative Model Adaptation (ZSGM) by addressing the critical issue of offset misalignment between image and text representations in the CLIP embedding space, showcasing originality in its formulation and methodology.
1. The references are somewhat disorganized and have formatting issues; for example, many citations should be formatted as (Smith et al., 2023) rather than Smith et al. (2023). Additionally, there is a lack of coherent context when citing references. 2. The writing of this paper could benefit from some improvement, as it contains several spelling errors (e.g., "Adaptatoin" instead of "Adaptation") and some grammatical inconsistencies. 3. In the Related Work section, this paper assumes that many
1. This paper is well-written and clearly introduces its motivation and research methods. First, it highlights the limitation of previous methods, which simply aligned image offsets with text offsets, and verifies this limitation through experiments. Next, it conducts experiments to validate the hypothesis that addressing these misalignments can lead to improved performance. Finally, based on this analysis, the paper presents its proposed research method. 2. The paper conduct an analysis of the
1. Table 1 and Table 2 present the results of the GAN model and the diffusion model, respectively. However, the evaluation metrics, comparison methods, and adaptations used for the two models are not consistent. 2. Most of the experiments conducted involve adaptation between two concepts with similar images. 3. Line 153 states, "Previous works assume that for two different concepts, α and β." However, Algorithm 1 and Algorithm 2 use two learning rates, also denoted as α and β. This could lead
1. This paper is well-organized and clearly presented. 2. The study on offset misalignment is novel to me. 3. The iterative refinement solution for misalignment is interesting and reasonable.
I did not find any remarkable flaws in this paper. However, I have one question: in the study in Section 3.1, the concept distance is measured between different classes, whereas in the impact study in Section 3.2, the concept distance is constructed using different hand-crafted prompts. Why are these setups misaligned?
1.The paper conducts an empirical study on a large public dataset to analyze offset misalignment in the CLIP embedding space, finding that misalignment increases as the concepts become more distant 2.Figures 2 and 3 vividly present the misalignment in the CLIP space and illustrate the impact of offset misalignment
1. There is a concern about the paper lacks theoretical proof or experimental evidence that the after limited iterations of the adaptation, the adapted generator is already closer to the target domain than the pre-trained generator 2. There is no sensitivity study conducted for the parameters t_int and t_thresh. Since these parameters play a critical role in introducing adaptive loss and updating the anchor points, their impact should be analyzed. 3. This paper lacks comparative experiments i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsFocus · Contrastive Language-Image Pre-training
