Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

Siqu Ou; Tianrui Wan; Zhiyuan Zhao; Junyu Gao; Xuelong Li

arXiv:2602.08241·cs.AI·February 10, 2026

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

Siqu Ou, Tianrui Wan, Zhiyuan Zhao, Junyu Gao, Xuelong Li

PDF

Open Access 1 Models

TL;DR

This paper identifies weak visual attention in multimodal large language models and introduces SAYO, a reinforcement learning-based approach that improves visual focus and reasoning accuracy.

Contribution

The paper proposes SAYO, a novel RL framework that explicitly aligns visual attention with reasoning steps, enhancing MLLMs' focus and reasoning capabilities.

Findings

01

SAYO improves performance across multiple benchmarks.

02

Enhanced visual attention leads to better reasoning accuracy.

03

Reinforcement learning effectively aligns visual focus with reasoning.

Abstract

While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Craleo/Sayo-Qwen-8B
model· 34 dl· ♡ 1
34 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Action Observation and Synchronization