More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

Hoang Anh Just, Yifei Fan, Handong Zhao, Jiuxiang Gu, Ruiyi Zhang, Simon Jenni, Kushal Kafle, Ruoxi Jia, Jing Shi

TL;DR

This paper introduces PeRL-VL, a framework that enhances vision-language models by separately improving visual perception and logical reasoning, leading to better accuracy and consistency in multimodal tasks.

Contribution

PeRL-VL is a decoupled approach that improves visual extraction and reasoning in VLMs, addressing key failure modes of existing RLVR-trained models.

Findings

01

Pass@1 accuracy improved from 63.3% to 68.8%.

02

Outperforms standard RLVR and other baselines.

03

Enhances logical consistency and visual faithfulness.

Abstract

Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)