Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter
Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, Guanbin Li

TL;DR
This paper introduces Mod-Adapter, a tuning-free method for multi-concept personalization in text-to-image generation, capable of customizing both object and abstract concepts without fine-tuning, using a novel modulation mechanism and vision-language guidance.
Contribution
The paper presents Mod-Adapter, a novel tuning-free approach that leverages a modulation mechanism and vision-language pre-training to personalize multiple concepts, including abstract ones, without test-time fine-tuning.
Findings
Achieves state-of-the-art results in multi-concept personalization.
Supports both object and abstract concept customization.
Outperforms existing methods in quantitative, qualitative, and human evaluations.
Abstract
Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pre-trained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we…
Peer Reviews
Decision·ICLR 2026 Poster
- Achieving tuning-free multi-concept personalization. Unlike previous methods, such as TokenVerse that require fine-tuning small MLPs for each new concept, Mod-Adapter is trained once to be universally applicable to all concepts. During inference, only reference images and concept words need to be input without any optimization steps. - Extensive experiments are conducted on several datasets. The proposed method achieves good performance in multi-concept personalization.
- The proposed Mod-adaptor is depedent on the training data. Mod-Adapter needs to be trained on datasets containing abstract concepts, synthetic data + MVImgNet + AFHQ. If the user concepts fall outside the training distribution, its generalization ability is questionable. - The performance degrades when customizing more than three concepts simultaneously, which limits its application in extremely complex scenarios. - Why in Table 1, the performance of Mod-adaptor is 0.61, which is lower than pr
- The paper addresses an important and underexplored challenge in multi-concept personalization - the simultaneous customization of both object and abstract concepts without test-time fine-tuning. This represents a significant practical limitation of existing methods that the authors successfully tackle. - The insight to leverage the localized and semantically meaningful properties of the DiT modulation space is particularly clever. This approach enables the additive combination of multiple conc
- The Mod-Adapter contains 1.67B parameters, which is large for an adapter module. This raises questions about the practical deployment of the method, especially in resource-constrained environments. The paper does not adequately address this concern. - The paper does not sufficiently explore the generalization capabilities of the method to unseen concept types or combinations. The ablation studies focus on component removal but lack analysis of performance under varying conditions. - The paper
1. The paper addresses the limitations of existing methods by proposing a tuning-free framework that can handle both object and abstract concepts. The extended benchmark is valuable for further studies. 2. The integration of Vision-Language Cross-Attention and MoE layers in the MOD-Adapter module is well-justified and effective, proven in ablation studies. 3. Experimental results underscore the effectiveness of the proposed method.
1. While MOD-ADAPTER introduces a tuning-free improvement, the core idea of leveraging the DiT modulation space for multi-concept personalization builds on TokenVerse. 2. The paper seems to claim its ability to deal with unseen concepts (neither used in training nor pre-training), however, the experiments are not split based on whether the concept is used in training. Besides, I doubt that some designs of the method, like the number of experts in MOE, may be highly dependent on the pre-trainin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Data Management and Algorithms · Text and Document Classification Technologies
MethodsDiffusion
