Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

Kesen Zhao; Beier Zhu; Junbao Zhou; Xingyu Zhu; Zhongqi Yue; Hanwang Zhang

arXiv:2602.23959·cs.CV·March 2, 2026

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

Kesen Zhao, Beier Zhu, Junbao Zhou, Xingyu Zhu, Zhongqi Yue, Hanwang Zhang

PDF

Open Access

TL;DR

This paper introduces NV-CoT, a novel framework enabling multimodal large language models to perform visual reasoning using continuous numerical coordinates, improving localization and accuracy over existing methods.

Contribution

NV-CoT allows MLLMs to reason over images with continuous coordinates, reducing modality mismatch and architectural complexity, and supports both supervised and reinforcement learning training.

Findings

01

Significantly improves localization precision.

02

Enhances final answer accuracy.

03

Accelerates training convergence.

Abstract

Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning