TL;DR
EgoWorld is a novel framework that reconstructs egocentric views from rich exocentric data including point clouds, hand poses, and text, overcoming limitations of previous methods and achieving state-of-the-art results across multiple datasets.
Contribution
EgoWorld introduces a new approach that uses 3D exocentric observations and diffusion models to generate dense egocentric images without relying on 2D cues or synchronized multi-view data.
Findings
Achieves state-of-the-art performance on four datasets.
Demonstrates strong generalization to new objects, actions, and scenes.
Shows robustness on in-the-wild examples.
Abstract
Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as the necessity of an initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce EgoWorld, a novel framework that reconstructs an egocentric view from rich exocentric observations, including point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is well-written and easy to follow. - The idea of leveraging multimodal information—including rich textual supervision, 3D geometric cues, and hand pose data—is simple, intuitive, and effective. - The evaluation is comprehensive, covering four representative ego–exo datasets.
- **Qualitative Results**: Most of the qualitative examples are drawn from the H2O dataset except for Fig 4. It would be beneficial to include more examples from the other three datasets—TACO, Assembly101, and Ego-Exo4D—to showcase a broader range of scenarios beyond desktop activities. This would help readers gain a more comprehensive understanding of the proposed approach’s generalizability across diverse environments. - **Ablation Study**: In Table 3, the presented ablation results are somew
- The concept of using textual cues to enhance visual synthesis across viewpoints is intuitive and promising. - The combination of sparse maps as partial observation with textual description for completion is interesting.
1. It would be better to highlight the difference compared with 4Diff or Exo2Ego. 4Diff adopts depth maps/point clouds while Exo2Ego adopts hand layout. In addition to depth maps and hand poses, the proposed method adopts an additional textual description for LDM as a condition, which like a combination. 2. The introduction of textual cues is the key of the proposed method and provides semantic alignment. However, the quality of the text and its contribution to the results is unclear. It would
* Framework utilizes multi-modal information, such as textual description, from an exocentric view in a plausible way. * The overall proposed method is well presented with clarity.
* Evaluation compares the proposed framework to novel view synthesis methods, many of which are outdated and not specialized for the egocentric setup. This is particularly problematic when there are methods that focus on an egocentric view. * Egocentric view transform is estimated based on the egocentric hand pose predicted from the external view. This would be highly dependent on the learned camera setup and the device's visual from an external view. Using ground-truth data in some evaluation f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
