Instant Preference Alignment for Text-to-Image Diffusion Models
Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue

TL;DR
This paper introduces a training-free, real-time framework for preference-aligned text-to-image generation using multimodal large language models to understand preferences and guide diffusion models, enabling interactive and nuanced image creation.
Contribution
The proposed method is the first to achieve instant, preference-aligned T2I generation without training, leveraging MLLMs for preference understanding and control, supporting multi-round refinement.
Findings
Outperforms prior methods in quantitative metrics
Achieves superior human evaluation scores
Enables real-time, interactive image generation
Abstract
Text-to-image (T2I) generation has greatly enhanced creative expression, yet achieving preference-aligned generation in a real-time and training-free manner remains challenging. Previous methods often rely on static, pre-collected preferences or fine-tuning, limiting adaptability to evolving and nuanced user intents. In this paper, we highlight the need for instant preference-aligned T2I generation and propose a training-free framework grounded in multimodal large language model (MLLM) priors. Our framework decouples the task into two components: preference understanding and preference-guided generation. For preference understanding, we leverage MLLMs to automatically extract global preference signals from a reference image and enrich a given prompt using structured instruction design. Our approach supports broader and more fine-grained coverage of user preferences than existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
