VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu; Yu Huang; Jiale Cheng; Yuanming Yang; Jiajun Xu; Yuan Wang; Wenbo Duan; Shen Yang; Qunlin Jin; Shurun Li; Jiayan Teng; Zhuoyi Yang; Wendi Zheng; Xiao Liu; Dan Zhang; Ming Ding; Xiaohan Zhang; Xiaotao Gu; Shiyu Huang; Minlie Huang; Jie Tang; Yuxiao Dong

arXiv:2412.21059·cs.CV·January 6, 2026

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Dan Zhang, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

VisionReward introduces a hierarchical, interpretable framework for learning fine-grained human preferences in image and video generation, significantly improving reward accuracy and alignment with human judgments.

Contribution

It presents a novel multi-dimensional, interpretable reward model for visual generative tasks, addressing limitations of black-box approaches and enhancing preference learning.

Findings

01

Outperforms existing reward models in preference prediction accuracy by 17.2%.

02

Achieves a 31.6% higher pairwise win rate in text-to-video generation.

03

Demonstrates significant improvements in aligning generated content with human preferences.

Abstract

Visual generative models have achieved remarkable progress in synthesizing photorealistic images and videos, yet aligning their outputs with human preferences across critical dimensions remains a persistent challenge. Though reinforcement learning from human feedback offers promise for preference alignment, existing reward models for visual generation face limitations, including black-box scoring without interpretability and potentially resultant unexpected biases. We present VisionReward, a general framework for learning human visual preferences in both image and video generation. Specifically, we employ a hierarchical visual assessment framework to capture fine-grained human preferences, and leverages linear weighting to enable interpretable preference learning. Furthermore, we propose a multi-dimensional consistent strategy when using VisionReward as a reward model during preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thudm/visionreward
pytorchOfficial

Datasets

Videos

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods