Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

Siqi Yang; Zilve Gao; Haibo Qiu; Fanfan Liu; Peng Shi; Zhixiong Zeng; Qingmin Liao; Lin Ma

arXiv:2512.17227·cs.CV·December 22, 2025

Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

Siqi Yang, Zilve Gao, Haibo Qiu, Fanfan Liu, Peng Shi, Zhixiong Zeng, Qingmin Liao, Lin Ma

PDF

Open Access

TL;DR

This paper introduces a two-stage curriculum framework for multimodal reasoning models that disentangles reasoning and perception skills, improving visual grounding and strategic perception in complex tasks.

Contribution

It proposes a novel disentangled training approach and reinforcement learning-based perception timing policy to enhance multimodal reasoning capabilities.

Findings

01

Improved visual grounding in long-chain reasoning tasks

02

Enhanced strategic perception through reinforcement learning

03

Disentangled training boosts reasoning robustness

Abstract

Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is "visual forgetting", where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as "think longer, see less". We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning "how-to-think") and (2) strategic visual perception ("when-to-look"). This creates a foundational cold-start deficiency -- weakening abstract reasoning -- and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Language, Metaphor, and Cognition