Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

Yurui Dong; Ziyue Wang; Shuyun Lu; Dairu Liu; Xuechen Liu; Fuwen Luo; Peng Li; Yang Liu

arXiv:2603.15467·cs.CV·March 17, 2026

Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

Yurui Dong, Ziyue Wang, Shuyun Lu, Dairu Liu, Xuechen Liu, Fuwen Luo, Peng Li, Yang Liu

PDF

Open Access

TL;DR

This paper introduces EscapeCraft-4D, a novel 4D environment to evaluate multimodal large models' ability to perform time-aware, cross-modal perception and reasoning under dynamic, time-sensitive conditions.

Contribution

It presents a new environment and benchmark for assessing temporal awareness and cross-modal integration in large models, addressing limitations of previous 2D/3D focused environments.

Findings

01

Models struggle with modality bias.

02

Significant gaps in integrating modalities under time constraints.

03

Insights into modality interactions in complex reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited support for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning. As a result, whether models can actively coordinate modalities and reason under time-varying, irreversible conditions remains underexplored. To this end, we introduce \textbf{EscapeCraft-4D}, a customizable 4D environment for assessing selective cross-modal perception and time awareness in Omni models. It incorporates trigger-based auditory sources, temporally transient evidence, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Social Robot Interaction and HRI