InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Yuhang Zang; Xiaoyi Dong; Pan Zhang; Yuhang Cao; Ziyu Liu; Shengyuan Ding; Shenxi Wu; Yubo Ma; Haodong Duan; Wenwei Zhang; Kai Chen; Dahua Lin; Jiaqi Wang

arXiv:2501.12368·cs.CV·May 21, 2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang

PDF

Open Access 1 Repo 2 Models

TL;DR

InternLM-XComposer2.5-Reward is a straightforward multi-modal reward model that aligns large vision language models with human preferences, improving instruction following, response selection, and data filtering across diverse domains.

Contribution

It introduces a simple, effective multi-modal reward model with a high-quality preference corpus, open-source implementation, and demonstrates its application in reinforcement learning, response selection, and data filtering.

Findings

01

Achieves top performance on multi-modal reward benchmarks.

02

Shows competitive results on text-only reward benchmarks.

03

Enhances instruction following and dialogue quality in experiments.

Abstract

Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

internlm/internlm-xcomposer
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis

MethodsSparse Evolutionary Training