Generative RLHF-V: Learning Principles from Multi-modal Human Preference

Jiayi Zhou; Jiaming Ji; Boyuan Chen; Jiapeng Sun; Wenqi Chen; Donghai Hong; Sirui Han; Yike Guo; Yaodong Yang

arXiv:2505.18531·cs.AI·May 27, 2025

Generative RLHF-V: Learning Principles from Multi-modal Human Preference

Jiayi Zhou, Jiaming Ji, Boyuan Chen, Jiapeng Sun, Wenqi Chen, Donghai Hong, Sirui Han, Yike Guo, Yaodong Yang

PDF

Open Access

TL;DR

Generative RLHF-V introduces a novel multi-modal alignment framework that combines generative reward models with reinforcement learning, significantly improving the performance and generalization of large language models aligned with human preferences.

Contribution

It proposes a two-stage pipeline integrating generative reward modeling with multi-modal RLHF, enhancing alignment accuracy and out-of-distribution generalization.

Findings

01

Improves 4 MLLMs' performance across 7 benchmarks by 18.1%.

02

Outperforms baseline RLHF, which improves by only 5.3%.

03

Achieves near-linear performance gains with more candidate responses.

Abstract

Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: $multi-modal generative reward modeling from RL$ , where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and $RL optimization from grouped comparison$ , which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques

MethodsALIGN