Instant Preference Alignment for Text-to-Image Diffusion Models

Yang Li; Songlin Yang; Xiaoxuan Han; Wei Wang; Jing Dong; Yueming Lyu; Ziyu Xue

arXiv:2508.17718·cs.CV·August 26, 2025

Instant Preference Alignment for Text-to-Image Diffusion Models

Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue

PDF

TL;DR

This paper introduces a training-free, real-time framework for preference-aligned text-to-image generation using multimodal large language models to understand preferences and guide diffusion models, enabling interactive and nuanced image creation.

Contribution

The proposed method is the first to achieve instant, preference-aligned T2I generation without training, leveraging MLLMs for preference understanding and control, supporting multi-round refinement.

Findings

01

Outperforms prior methods in quantitative metrics

02

Achieves superior human evaluation scores

03

Enables real-time, interactive image generation

Abstract

Text-to-image (T2I) generation has greatly enhanced creative expression, yet achieving preference-aligned generation in a real-time and training-free manner remains challenging. Previous methods often rely on static, pre-collected preferences or fine-tuning, limiting adaptability to evolving and nuanced user intents. In this paper, we highlight the need for instant preference-aligned T2I generation and propose a training-free framework grounded in multimodal large language model (MLLM) priors. Our framework decouples the task into two components: preference understanding and preference-guided generation. For preference understanding, we leverage MLLMs to automatically extract global preference signals from a reference image and enrich a given prompt using structured instruction design. Our approach supports broader and more fine-grained coverage of user preferences than existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.