EgoTwin: Dreaming Body and View in First Person

Jingqiao Xiu; Fangzhou Hong; Yicong Li; Mengze Li; Wentao Wang; Sirui Han; Liang Pan; Ziwei Liu

arXiv:2508.13013·cs.CV·August 19, 2025

EgoTwin: Dreaming Body and View in First Person

Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, Ziwei Liu

PDF

Open Access 3 Reviews

TL;DR

EgoTwin is a novel framework that jointly generates egocentric videos and human motion by modeling head-centric motion and causal interactions, addressing the underexplored area of first-person view synthesis.

Contribution

The paper introduces EgoTwin, a diffusion transformer-based model with head-centric motion representation and causal interaction mechanisms for egocentric video and motion generation.

Findings

01

EgoTwin effectively aligns camera trajectories with human head motion.

02

The framework captures causal interplay between visual dynamics and motion.

03

Experiments show superior performance on a new real-world dataset.

Abstract

While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- The proposed approach, built on a triple-branch diffusion transformer, effectively models text, video, and motion within a unified framework. The cybernetics-inspired interaction mechanism is interesting, which captures the bidirectional dependencies between visual and motion streams. - The paper is well-written, clearly structured, and easy to follow, with strong motivation and coherent presentation of technical details. - The evaluation is extensive and thorough, covering diverse quantitat

Weaknesses

- The approach relies exclusively on the Nymeria dataset for training and evaluation. While Nymeria is a large and well-curated dataset, the generalization of EgoTwin to other egocentric or synthetic environments remains untested, like EgoExo4D. - The base model design choice for the text–video component relies on CogVideoX. While this is a strong foundation, it raises the question of whether using a more recent or higher-capacity base model could further improve video quality or multimodal ali

Reviewer 02Rating 8Confidence 3

Strengths

1. The paper is well-written and easy to follow, with the main idea clearly articulated. The implementation details are sufficient for reproduction. 2. The design of head-centric motion tokenization is novel. 3. The interaction mechanism is reasonable; it employs local temporal attention, which not only improves accuracy but also reduces computational load.

Weaknesses

1. While head-centric motion tokenization may enhance head-centric evaluations, the quality of the whole body is not assessed. Can a full-body evaluation be included? 2. A related paper is not cited: https://egoallo.github.io.

Reviewer 03Rating 4Confidence 2

Strengths

1. To the best of my knowledge, EgoTwin is the first work to explicitly model joint egocentric video and human motion generation. 2. Several useful and reasonable techniques are proposed in this paper to handle this joint generation. The head-centric motion representation directly solves the limitation of root-centric representations. The attention mechanism is adjusted to improve the multi-modal learning and address causal interplay.

Weaknesses

1. The main concern is the unclear motivation. The paper does not sufficiently justify why a complex triple-branch diffusion generation (with specialized motion generation/video interaction modules) is required, especially given recent advances in scaling video foundation models (e.g., Genie3) that can generate high-fidelity, interactive egocentric videos via implicit scene and motion modeling. EgoTwin heavily depends on explicit motion representation and cross-modal modules, and well-labeled da

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Face recognition and analysis