PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation
Wenyi Mo, Tianyu Zhang, Yalong Bai, Ligong Han, Ying Ba, Dimitris N. Metaxas

TL;DR
PrefGen introduces a multimodal framework that uses large language models to encode user preferences for personalized image generation, significantly improving alignment with individual aesthetic choices.
Contribution
The paper proposes a novel multimodal preference learning approach that leverages MLLMs and alignment techniques to enhance personalized image generation beyond textual prompts.
Findings
Outperforms baselines in image quality and preference alignment
Effectively captures nuanced user preferences through multimodal embeddings
Demonstrates the importance of alignment loss for multimodal compatibility
Abstract
Preference-conditioned image generation seeks to adapt generative models to individual users, producing outputs that reflect personal aesthetic choices beyond the given textual prompt. Despite recent progress, existing approaches either fail to capture nuanced user preferences or lack effective mechanisms to encode personalized visual signals. In this work, we propose a multimodal framework that leverages multimodal large language models (MLLMs) to extract rich user representations and inject them into diffusion-based image generation. We train the MLLM with a preference-oriented visual question answering task to capture fine-grained semantic cues. To isolate preference-relevant features, we introduce two complementary probing tasks: inter-user discrimination to distinguish between different users, and intra-user discrimination to separate liked from disliked content. To ensure…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper introduces an elegant multimodal framework that systematically disentangles and aligns user-specific preference signals from different layers of an MLLM, providing conceptual clarity and technical novelty. * The proposed MMD-based alignment is a well-motivated alternative to rigid point-wise alignment losses, leading to more stable and generalizable conditioning across diffusion backbones. * Extensive experiments, including a new benchmark (PREFBENCH) and human evaluations, show cons
* The reliance on a large synthetic agent-generated dataset raises concerns about ecological validity and generalization to truly diverse human preferences, which is only partially addressed by the smaller real-user subset. * While the method demonstrates strong results, the added complexity of multimodal probing, dual discrimination tasks, and distribution alignment increases implementation burden and may limit reproducibility. * The paper lacks a deeper theoretical or ablation-based explanatio
1. The problem definition is clear. 2. It uses distribution-alignment losses (MMD) for robust embedding learning. 3. It demonstrates comprehensive comparisons across six personalization baselines. 4. It introduces the large-scale PREFBench dataset with synthetic and real user clusters
1. From an overall perspective, this paper presents an engineering-oriented work, employing rather straightforward methodologies such as MMD. The motivation behind the study lacks clarity. 2. The dataset over-relies on virtual "user clusters," which may inflate controllability and impact real human diversity. The ratio between synthetic and real data should be adjusted to validate the efficacy. 3. The ablation studies on MMD and disentanglement stability should be strengthened. 4. Aesthetic anal
- Fig 4 has nice examples of potential use cases i.e., showing that PrefGen can be used for product design or character design, that can beyond purely image generation. - The paper was well written and easy to follow.
- Generalization to new (OOD) users/preferred images: The results evaluates on unseens users, from the same distribution as those in the training data. It would be interesting to also see how well the method does with user preferences and their corresponding liked/disliked images that are different from what was used during training (generated images), i.e., real images, photographs, sketches etc. - Preference history: One way to make the paper stronger could be to consider each user having a p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
