Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Shizhan Gong, Minda Hu, Qiyuan Zhang, Chen Ma, Qi Dou

TL;DR
Saliency-R1 enhances vision-language model interpretability and faithfulness by aligning saliency maps with visual evidence, improving reasoning transparency and accuracy.
Contribution
Introduces a saliency map technique and a reward function to improve VLM reasoning interpretability and faithfulness without extra computational cost.
Findings
Improves reasoning faithfulness and interpretability.
Enhances overall task performance.
Aligns model focus with human-annotated regions.
Abstract
Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
