PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

Weijie Zhou; Xuantang Xiong; Yi Peng; Manli Tao; Chaoyang Zhao; Honghui Dong; Ming Tang; Jinqiao Wang

arXiv:2510.21111·cs.CV·October 27, 2025

PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

Weijie Zhou, Xuantang Xiong, Yi Peng, Manli Tao, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang

PDF

Open Access

TL;DR

This paper introduces Active Visual Reasoning (AVR), a new task for multimodal large language models that involves actively exploring and reasoning in partially observable, interactive environments, supported by a new benchmark and dataset.

Contribution

It proposes the AVR task, creates the CLEVR-AVR benchmark and AVR-152k dataset, and develops PhysVLM-AVR, a model that advances active visual reasoning in complex environments.

Findings

01

PhysVLM-AVR achieves state-of-the-art results on multiple benchmarks.

02

Current embodied MLLMs struggle with active information acquisition.

03

The dataset enables training models for iterative reasoning and decision-making.

Abstract

Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment-moving, examining, and manipulating objects-to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis