V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
Zhiwei Ning, Xuanang Gao, Jiaxi Cao, Gengming Zhang, Shengnan Ma, Wenwen Tong, Hanming Deng, Jie Yang, Wei Liu

TL;DR
V-ABS introduces an action-observer beam search framework with adaptive weighting and a large-scale dataset, significantly improving multi-step visual reasoning in multimodal large language models.
Contribution
It presents a novel action-observer driven beam search method with entropy-based adaptive weighting and a large supervised dataset to enhance reasoning stability and accuracy.
Findings
Achieves 19.7% average improvement over Qwen3-VL-8B baseline.
Demonstrates state-of-the-art performance across eight benchmarks.
Provides a large-scale dataset for training and evaluation.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
