Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Changyuan Tian; Zhicong Lu; Huaxing Liu; Xiang Wang; Shuai Li; Yu Chen; Wenqian Lv; Zichuan Lin; Juncheng Diao; Deheng Ye

arXiv:2605.22072·cs.CL·May 22, 2026

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye

PDF

TL;DR

Faithful-MR1 introduces a training framework that improves multimodal reasoning by explicitly supervising visual attention and reinforcing faithful use of visual evidence, leading to better performance on benchmarks.

Contribution

It proposes a novel two-stage training method that explicitly anchors and reinforces visual attention to enhance faithfulness in multimodal reasoning models.

Findings

01

Outperforms recent baselines on Qwen2.5-VL-Instruct benchmarks.

02

Achieves better reasoning accuracy with less training data.

03

Effectively aligns visual perception with reasoning processes.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.