DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation
Zhizhong Wang, Tianyi Chu, Zeyi Huang, Nanyang Wang, Kehan Li

TL;DR
DynaIP introduces a dynamic, scalable plugin for zero-shot personalized text-to-image generation that improves concept fidelity, balances prompt following, and enhances multi-subject personalization by leveraging hierarchical features and a decoupling strategy.
Contribution
The paper proposes DynaIP, a novel plugin that enhances fine-grained concept fidelity, balances CP-PF, and scales to multi-subject PT2I without test-time fine-tuning, using a dynamic decoupling strategy and hierarchical feature fusion.
Findings
DynaIP outperforms existing methods in single- and multi-subject PT2I tasks.
The hierarchical features of CLIP effectively capture visual information at multiple granularities.
The dynamic decoupling strategy improves the CP-PF balance and scalability.
Abstract
Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Strong OOD performance: The approach shows consistent performance gains across various OOD and compositional reasoning benchmarks. 2. Training-free adaptability: The method improves generalization at test time without requiring fine-tuning of the underlying VLM, making it easy to deploy. 3. Dynamic prompt selection: Unlike static prompt methods, DynaIP adapts to each test example by selecting the most relevant prompt patches, enhancing flexibility.
1. Complexity of policy training: While inference is training-free, the policy model itself must be trained in advance, and the training procedure is not fully detailed. This raises potential concerns regarding reproducibility and scalability. 2. Limited theoretical grounding: The paper lacks a deeper theoretical explanation for why dynamic patch selection improves generalization, especially under significant domain shifts.
- *Achieves strong performance.* Methods proposed in this paper effectively addresses the key challenges faced by current approaches to personalized text-to-image (PT2I) generation, and its efficacy is convincingly demonstrated through extensive qualitative examples and comprehensive quantitative evaluations. - *Comprehensive comparisons.* The selection of methods for comparison is thorough and well-considered, encompassing a broad spectrum of both open-source and closed-source state-of-the-art
- *Limited post-hoc analysis of key components.* The paper lacks in-depth post-hoc analysis of the proposed Dynamic Decoupling Strategy (DDS) and Hierarchical Mixture-of-Experts Feature Fusion Module (HMoE-FFM). For instance, the decoupling effect of DDS could be visualized through attention maps, and the $w_l$ in Eq.~(7) of HMoE-FFM could be analyzed across diverse cases to reveal how granularity control is adaptively achieved. Such analyses would significantly enhance the interpretability and
1. The proposed method is technically sound and effective for personalized image generation. The DDS effectively disentangles concept-specific and concept-agnostic information, leading to a better balance between CP and PF, and the HMoE-FFM module utilizes multi-level CLIP features, providing flexible control over visual granularity. 2. Comprehensive Experiments: The paper includes extensive experiments on both single- and multi-subject datasets, demonstrating the effectiveness of DynaIP across
1. The evaluation details, including the system prompts, detailed metric of Table.1 should be involved. To me, the prompt following (PF) results of multi-subject, i.e., 0.997 is not convincing, as this means nearly all user instructions are perfectly rendered by the proposed method. It would be better in include more details to clarify this. Additionally, the reliance on a single Vision-Language Model (VLM) for evaluation could still introduce biases or limitations. 2. In MM-DiT style models, af
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
