Thinking with Images via Self-Calling Agent

Wenxi Yang; Yuzhong Zhao; Fang Wan; Qixiang Ye

arXiv:2512.08511·cs.CV·December 12, 2025

Thinking with Images via Self-Calling Agent

Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye

PDF

Open Access 1 Models

TL;DR

This paper introduces Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates interleaved multimodal reasoning as a language-only process, improving efficiency and performance without relying on scarce high-quality reasoning data.

Contribution

sCoT reformulates multimodal visual reasoning as a language-only process with self-calling, enhancing training efficiency and reasoning performance while reducing computational costs.

Findings

01

sCoT improves reasoning accuracy by up to 1.9% on HR-Bench 4K.

02

sCoT reduces GPU hours by approximately 75% compared to baseline methods.

03

sCoT demonstrates effective reasoning without explicit modality interleaving.

Abstract

Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ywenxi/SubagentVL-7B-Fine-Chart-80
model· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning