V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

Zhiwei Ning; Xuanang Gao; Jiaxi Cao; Gengming Zhang; Shengnan Ma; Wenwen Tong; Hanming Deng; Jie Yang; Wei Liu

arXiv:2605.10172·cs.CV·May 12, 2026

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

Zhiwei Ning, Xuanang Gao, Jiaxi Cao, Gengming Zhang, Shengnan Ma, Wenwen Tong, Hanming Deng, Jie Yang, Wei Liu

PDF

TL;DR

V-ABS introduces an action-observer beam search framework with adaptive weighting and a large-scale dataset, significantly improving multi-step visual reasoning in multimodal large language models.

Contribution

It presents a novel action-observer driven beam search method with entropy-based adaptive weighting and a large supervised dataset to enhance reasoning stability and accuracy.

Findings

01

Achieves 19.7% average improvement over Qwen3-VL-8B baseline.

02

Demonstrates state-of-the-art performance across eight benchmarks.

03

Provides a large-scale dataset for training and evaluation.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.