Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention
Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye

TL;DR
Faithful-MR1 introduces a training framework that improves multimodal reasoning by explicitly supervising visual attention and reinforcing faithful use of visual evidence, leading to better performance on benchmarks.
Contribution
It proposes a novel two-stage training method that explicitly anchors and reinforces visual attention to enhance faithfulness in multimodal reasoning models.
Findings
Outperforms recent baselines on Qwen2.5-VL-Instruct benchmarks.
Achieves better reasoning accuracy with less training data.
Effectively aligns visual perception with reasoning processes.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
