User-Friendly Customized Generation with Multi-Modal Prompts
Linhao Zhong, Yan Hong, Wentao Chen, Binglin Zhou, Yiyi Zhang, Jianfu, Zhang, Liqing Zhang

TL;DR
This paper introduces a user-friendly multi-modal prompt approach for text-to-image generation that simplifies customization by requiring only one image and text per concept, improving ease of use and customization complexity.
Contribution
It proposes a novel multi-modal prompt method that reduces user effort and enhances customization capabilities in text-to-image models compared to existing finetuning techniques.
Findings
Outperforms finetune-based methods in user-friendliness
Enables complex object customization with minimal inputs
Facilitates precise scene customization
Abstract
Text-to-image generation models have seen considerable advancement, catering to the increasing interest in personalized image creation. Current customization techniques often necessitate users to provide multiple images (typically 3-5) for each customized object, along with the classification of these objects and descriptive textual prompts for scenes. This paper questions whether the process can be made more user-friendly and the customization more intricate. We propose a method where users need only provide images along with text for each customization topic, and necessitates only a single image per visual concept. We introduce the concept of a ``multi-modal prompt'', a novel integration of text and images tailored to each customization concept, which simplifies user interaction and facilitates precise customization of both objects and scenes. Our proposed paradigm for customized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Software Engineering Methodologies
