PreferThinker: Reasoning-based Personalized Image Preference Assessment
Shengqi Xu, Xinpeng Zhou, Yabo Zhang, Ming Liu, Tao Liang, Tianyu Zhang, Yalong Bai, Zuxuan Wu, Wangmeng Zuo

TL;DR
This paper introduces PreferThinker, a reasoning-based framework for personalized image preference assessment that predicts user profiles from limited references and provides interpretable multi-dimensional evaluations, leveraging large-scale data and structured reasoning.
Contribution
It proposes a novel predict-then-assess paradigm with a preference profile predictor, a large-scale CoT-style dataset, and a two-stage training strategy including reinforcement learning.
Findings
Outperforms existing methods in personalized preference assessment.
Effectively predicts user preference profiles from limited references.
Provides interpretable, multi-dimensional image assessments.
Abstract
Personalized image preference assessment aims to evaluate an individual user's image preferences by relying only on a small set of reference images as prior information. Existing methods mainly focus on general preference assessment, training models with large-scale data to tackle well-defined tasks such as text-image alignment. However, these approaches struggle to handle personalized preference because user-specific data are scarce and not easily scalable, and individual tastes are often diverse and complex. To overcome these challenges, we introduce a common preference profile that serves as a bridge across users, allowing large-scale user data to be leveraged for training profile prediction and capturing complex personalized preferences. Building on this idea, we propose a reasoning-based personalized image preference assessment framework that follows a \textit{predict-then-assess}…
Peer Reviews
Decision·ICLR 2026 Poster
* This work proposes an interesting system to predict a user's preference profile for text-to-image generation, and then score generated images for that individual * This work contributes a new large scale synthetic preference dataset PreferImg created from 80K synthetic user preference profiles with attributes that the authors choose after a real-world user study. This dataset can be a valuable resource for the text-to-image reward modeling community * The authors demonstrate that their dataset
* The authors argue that *"although each user’s personalized preferences are unique, the key visual elements that shape these preferences are shared"* (L197-198), and they mention discrete attributes that users rank highly as important to them (art style, color, detail, art medium and saturation). I feel that this a strong assumption to make - what about individual preference differences that are more semantic in nature for a given prompt? Is it possible to discretize real-world user preferences
This paper tries to tackle the problem of personalized preference assessment by using the idea of common preference profile as a bridge between users. This idea is novel. Beyond this, the paper also introduces a new, large scale dataset for personalized assessment. The experiments are robust, covering seen vs. unseen profiles , single vs. multi-preference users , and robustness to the number of reference images.
The primary weakness is that the main dataset, PreferImg-CoT, is built on simulated user preferences. While the simulation pipeline is well-designed (based on a user study to find 5 key elements ), simulated profiles may not capture the full, complex, and sometimes contradictory or hard-to-articulate nature of real human preferences.
This paper addresses personalized image preference assessment from a novel visual preference profile based interpretable perspective. A large-scale Chain-of-Thought (CoT)-style personalized assessment dataset annotated with diverse user preference profiles and high-quality CoT-style reasoning is contructed, enabling explicit supervision of structured reasoning. Experiments demonstrate the superiority of the proposed method.
My main concerns is how the proposed method generalizes to real-world images. To costruct a large-scale CoT-style dataset that provides high-quality reasoning supervision, the authors propose to combine several random profiles with initial prompts and feed into a text-to-image model to generate each user’s reference images (preferred and non-preferred) and two candidate images. However, the generated images ,as shown in paper and supplementary, lack photorealism, and would also cover a very limt
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Aesthetic Perception and Analysis
