MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

Xuhui Zheng; Kang An; Ziliang Wang; Yuhang Wang; Faqiang Qian; Yichao Wu

arXiv:2512.07203·cs.CV·December 9, 2025

MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, Faqiang Qian, Yichao Wu

PDF

Open Access

TL;DR

MMRPT introduces a novel reinforcement learning-based pre-training method for vision-language models, emphasizing visual reasoning over caption imitation, leading to improved zero-shot performance and robustness.

Contribution

First integration of reinforcement learning into large vision-language model pre-training to enhance visual grounding and reasoning capabilities.

Findings

01

Consistent zero-shot performance improvements across benchmarks.

02

Significant robustness gains under supervised fine-tuning.

03

Reinforcement-driven masked reasoning enhances model generalization.

Abstract

Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling