EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Wenjia Wang; Liang Pan; Huaijin Pi; Yuke Lou; Xuqian Ren; Yifan Wu; Zhouyingcheng Liao; Lei Yang; Rishabh Dabral; Christian Theobalt; and Taku Komura

arXiv:2602.23205·cs.CV·April 3, 2026

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, and Taku Komura

PDF

1 Repo

TL;DR

EmbodMocap introduces a portable, dual-iPhone system for in-the-wild 4D human-scene reconstruction, enabling large-scale, scene-consistent data collection without static setups.

Contribution

A novel, affordable dual-iPhone pipeline for metric-scale, scene-aware human motion capture in everyday environments, improving over monocular methods.

Findings

01

Achieves superior alignment and reconstruction compared to single iPhone or monocular models.

02

Enables training of models for human-scene reconstruction, physics-based animation, and robot motion control.

03

Validated through experiments demonstrating pipeline effectiveness and applications in embodied AI.

Abstract

Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenjiawang0312/EmbodMocap
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.