DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
Zhihong Zhang, Jie Zhao, Xiaojian Huang, Jin Xu, Zhuodong Luo, Xin Liu, Jiansheng Wei, Xuejin Chen

TL;DR
This paper introduces DT2IT-MRM, a novel framework for improving multimodal reward models by debiasing preference data and iteratively training, leading to state-of-the-art results on key benchmarks.
Contribution
The paper proposes a new pipeline and training framework that enhances the quality of multimodal preference datasets for reward modeling.
Findings
Achieves state-of-the-art performance on VL-RewardBench.
Effectively curates noisy preference datasets.
Improves alignment of multimodal models with human preferences.
Abstract
Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
