VisualActBench: Can VLMs See and Act like a Human?
Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan, Jing Bi, Christopher Brinton, Ehsan Hoque, Jiebo Luo

TL;DR
This paper introduces VisualActBench, a large-scale benchmark to evaluate vision-language models' ability to reason and act proactively in visual environments, revealing significant gaps compared to human reasoning.
Contribution
It proposes a new task, Visual Action Reasoning, and provides a comprehensive benchmark with annotated videos to assess models' proactive reasoning and decision-making capabilities.
Findings
Frontier models like GPT4o perform relatively well but still lag behind humans.
Current VLMs struggle with complex context interpretation and outcome anticipation.
Significant room for improvement in proactive, high-priority action generation.
Abstract
Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Visual Attention and Saliency Detection
