Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs
Jiaao Yu, Shenwei Li, Mingjie Han, Yifei Yin, Wenzheng Song, Chenghao Jia, Man Lan

TL;DR
This paper introduces a novel fine-tuning task and benchmark for vision-language models that emphasizes integrating visual context and commonsense reasoning, significantly enhancing their generalization in diverse multimodal scenarios.
Contribution
The paper proposes a new masked prediction task and a specialized evaluation benchmark, along with a reinforcement fine-tuning method, to improve reasoning and generalization in vision-language models.
Findings
Enhanced reasoning capabilities in VLMs through the new training task.
Improved out-of-distribution and cross-task generalization.
Reinforcement fine-tuning with prior sampling boosts model performance.
Abstract
Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real world multimodal scenarios, most notably, vision language tasks, due to a heavy focus on single modal language settings. While efforts to transplant reinforcement learning techniques from NLP to VLMs have emerged, these approaches often remain confined to perception centric tasks or reduce images to textual summaries, failing to fully exploit visual context and commonsense knowledge, ultimately constraining the generalization of reasoning capabilities across diverse multimodal environments. To address this limitation, we introduce a novel fine tuning task, Masked Prediction via Context and Commonsense, which forces models to integrate visual context and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
