Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

Tianyi Bai; Zengjie Hu; Fupeng Sun; Jiantao Qiu; Yizhen Jiang; Guangxin He; Bohan Zeng; Conghui He; Binhang Yuan; and Wentao Zhang

arXiv:2506.07235·cs.CV·June 10, 2025

Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, and Wentao Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a dynamic, verifier-guided visual reasoning framework for multi-modal large language models, enabling iterative refinement of visual understanding during inference, which improves accuracy and interpretability.

Contribution

It proposes a novel inference-time visual token scaling method with verifier-guided reasoning, formulated as a Markov Decision Process, and introduces a new dataset for training and evaluation.

Findings

01

Outperforms existing methods on visual reasoning benchmarks

02

Enables iterative, context-aware visual reasoning during inference

03

Provides more interpretable and grounded reasoning processes

Abstract

Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-dataflow/vts-v
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection

MethodsADaptive gradient method with the OPTimal convergence rate