Thinking with Images via Self-Calling Agent
Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye

TL;DR
This paper introduces Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates interleaved multimodal reasoning as a language-only process, improving efficiency and performance without relying on scarce high-quality reasoning data.
Contribution
sCoT reformulates multimodal visual reasoning as a language-only process with self-calling, enhancing training efficiency and reasoning performance while reducing computational costs.
Findings
sCoT improves reasoning accuracy by up to 1.9% on HR-Bench 4K.
sCoT reduces GPU hours by approximately 75% compared to baseline methods.
sCoT demonstrates effective reasoning without explicit modality interleaving.
Abstract
Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
