From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
Ruilin Luo, Chufan Shi, Yizhen Zhang, Cheng Yang, Songtao Jiang, Tongkun Guan, Ruizhe Chen, Ruihang Chu, Peng Wang, Mingkun Yang, Yujiu Yang, Junyang Lin, Zhibo Yang

TL;DR
This paper introduces a new attention-based metric called Visual Attention Score (VAS) to analyze cold-start training in multimodal models, revealing that attention to visual tokens correlates strongly with reasoning performance and proposing a novel framework, AVAR, to improve multimodal reasoning.
Contribution
The paper presents VAS as a novel metric for understanding cold-start effects and introduces AVAR, a comprehensive framework that significantly enhances multimodal reasoning performance.
Findings
VAS strongly correlates with reasoning performance (r=0.9616)
Multimodal cold-start does not increase VAS, unlike text-only cold-start
AVAR achieves an average 7.0% gain across benchmarks
Abstract
The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text-only cold-start leads to a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference, performance gains of 12% without any…
Peer Reviews
Decision·ICLR 2026 Poster
(1) It is an interesting observation that reasoning performance has an almost linear correlation with attention toward visual tokens. (2) The paper shows that, even with training-free re-weighting, models can achieve better performance when steering their focuses to visual tokens. (3) A new data synthesis paradigm is developed, which could potentially benefit training subsequent models. (4) The proposed method shows promise in improving the performance of the generic baselines on multiple ben
(1) The variance in attention distribution could also be affected by the design of system prompts used in different models. How does different choices of system prompts affect the visual attention score? Would tuning the prompts, e.g., imposing stronger focus on visual content, help boost the reasoning performance? (2) The paper only experiments with a single baseline (i.e., Qwen2.5-VL-7B, which is not a reasoning-specific model), and it is unclear whether it will generalize. It would be reason
1. The finding that text-only cold-start initialization can be more effective at increasing VAS than multimodal cold-starts is a novel and impressive observation. 2. The training-free intervention (Section 4) provides a compelling causal link between attention allocation and performance, even without full retraining. 3. The final model, AVAR-Thinker, demonstrates a significant performance improvement over the Qwen-2.5-VL-7B baseline, especially on challenging reasoning tasks. 4. The curated V
1. **The validity of VAS as a primary metric for visual grounding is questionable.** The paper defines VAS as a ratio of attention to visual tokens over system tokens (Eq. 1). A high VAS score could be achieved simply by aggressively reducing attention to system tokens, even if the absolute attention to visual tokens remains low or unchanged. The paper does not provide analysis to disentangle these two effects. To truly support the claim that AVAR "attends to visual tokens more," the authors sho
1. Innovative and Logical Metric. The authors introduce VAS, a novel metric that quantifies visual attention and its relationship with reasoning performance. The strong correlation between VAS and model performance provides a fresh perspective on multimodal reasoning (Sec. 3.1-3.2). 2. Sound Motivation. Based on VAS, the paper offers a clear analysis revealing the challenges in MLRM training (Sec. 3.3). The authors discover Lazy Attention Localization, an unexpected phenomenon where multimodal
1. In L184–186, "models initialized with unimodal reasoning data, such as OVR-CS and Revisual-R1-CS, maintain 15–20% higher attention to visual features compared to those trained with multimodal reasoning data such as R1-OneVision and ThinkLite-VL." appears unfair, since these methods differ in multiple factors such as dataset composition, model architecture, and training strategy, not merely in whether multimodal reasoning data are used. 2. For Eq. (4) and Eq. (8), the hyperparameters $\alpha$
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
