From Perception to Action: An Interactive Benchmark for Vision Reasoning

Yuhao Wu; Maojia Song; Yihuai Lan; Lei Wang; Zhiqiang Hu; Yao Xiao; Heng Zhou; Weihua Zheng; Dylan Raharja; Soujanya Poria; Roy Ka-Wei Lee

arXiv:2602.21015·cs.CV·February 25, 2026

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dylan Raharja, Soujanya Poria, Roy Ka-Wei Lee

PDF

Open Access

TL;DR

This paper introduces CHAIN, an interactive benchmark for evaluating vision models' ability to understand and act within physical environments, highlighting current models' limitations in reasoning about structure and causality.

Contribution

The paper presents CHAIN, a novel physics-based, interactive benchmark for assessing vision models' reasoning and planning capabilities in dynamic, structured environments.

Findings

01

State-of-the-art models struggle with physical structure understanding.

02

Models have difficulty planning long-horizon actions.

03

Current models cannot reliably translate perception into effective actions.

Abstract

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Action Observation and Synchronization