ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li

TL;DR
ENACT introduces a novel benchmark for evaluating embodied cognition in vision-language models through world modeling tasks based on egocentric interactions, revealing significant gaps between models and human performance.
Contribution
This paper presents ENACT, a new benchmark for assessing embodied cognition in VLMs via world modeling tasks derived from egocentric interaction data.
Findings
Models perform better on inverse tasks than forward tasks.
Performance gap between models and humans widens with interaction horizon.
Models show anthropocentric biases like right-handed action preference.
Abstract
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning,…
Peer Reviews
Decision·ICLR 2026 Poster
- Compares the ability of VLMs to perform forward vs. inverse world modeling on the Behavior1k dataset with egocentric inputs. - Provides several novel discussions and insights. - Demonstrates consistent trends between open- and closed-source baselines.
1. Novelty: How does this approach compare to GVL[1]? GVL also performs zero shot reordering task using closed-source VLMs and reports high success rates on existing embodied benchmarks. I wonder if ENACT is just evaluating different models using GVL's proposed methods? Also, a citation is missing. 2. Limited Discussion of Model-Specific Differences: The paper mentions that the tested VLMs are trained on “static” datasets, and I think the static here is unclear what this means (image-only/di
- The benchmark design is solid and principled. It uses POMDP framing with two tasks formulated as permutation prediction to avoid hand-crafted distractors. The forward task requires reordering images to match a given action sequence, while the inverse task requires reordering actions to match a given image sequence. An online verifier allows multiple valid permutations, and the metrics include Task Accuracy (exact match) and Pairwise Accuracy (adjacent consistency). The benchmark achieves solid
- The central research question, in my view, would benefit from clarification. The phrasing “to what extent does embodied cognition emerge from such training?” seems to suggest an over-time evaluation across training stages, yet the current evaluation (on end models; based solely on benchmark performance) cannot adequately address that. VLMs typically undergo multiple distinct training phases (e.g., large-scale language pretraining, multimodal alignment, instruction tuning, reinforcement learnin
### Main Strengths of the ENACT Paper * **1. Novelty and Comprehensiveness in Evaluation (Consequence-Aware World Modeling)** * **Summary**: ENACT elevates the evaluation of embodied AI beyond simple perception or isolated interactions to **consequence-aware world modeling** over **extended time horizons**. * **Detail**: It introduces complementary **Forward** (Action $\rightarrow$ State) and **Inverse** (State $\rightarrow$ Action) sequence reordering tasks, providing a holistic test o
### Main Limitations of the ENACT Benchmark (Weaknesses) * **1. Reliance on Simulation and the Sim-to-Real Gap** * **Summary**: The entire ENACT dataset and evaluation are restricted to high-fidelity trajectories within the **BEHAVIOR simulator**. * **Detail**: This introduces the inherent **sim-to-real gap** limitation. The model's strong performance (or failures) in the clean, deterministic physics and visuals of the simulation may not directly translate to the stochastic, noisy, and
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Domain Adaptation and Few-Shot Learning
