Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

Ziqi Ni; Yuanzhi Liang; Rui Li; Yi Zhou; Haibin Huang; Chi Zhang; Xuelong Li

arXiv:2511.18719·cs.CV·May 18, 2026

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibin Huang, Chi Zhang, Xuelong Li

PDF

TL;DR

ViPO enhances visual generative models by incorporating pixel-level, spatially, and temporally aware feedback into reinforcement learning, leading to better alignment with human preferences and improved generalization.

Contribution

Introduces ViPO, a structured advantage method that leverages pretrained vision backbones to improve reinforcement learning for visual content generation.

Findings

01

ViPO outperforms vanilla GRPO on image and video benchmarks.

02

It improves in-domain alignment with human preferences.

03

It enhances out-of-domain generalization.

Abstract

Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning