EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Mingxian Lin; Wei Huang; Yitang Li; Chengjie Jiang; Kui Wu; Fangwei Zhong; Shengju Qian; Xin Wang; Xiaojuan Qi

arXiv:2507.10548·cs.CV·July 15, 2025

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi

PDF

Open Access

TL;DR

EmRACE-3K introduces a comprehensive dataset and benchmark for evaluating vision-language models in embodied, interactive environments, highlighting current limitations and demonstrating improvements through fine-tuning and reinforcement learning.

Contribution

The paper presents EmRACE-3K, a new large-scale dataset and benchmark for embodied reasoning in complex environments, and shows how fine-tuning improves model performance.

Findings

01

Models perform below 20% success in zero-shot settings.

02

Fine-tuning with supervised and reinforcement learning improves performance.

03

EmRACE-3K effectively enables development of embodied reasoning capabilities.

Abstract

Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies