EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
Xiaofeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Guosheng Zhao, Xiaoyi, Bao, Zheng Zhu, Yingya Zhang, Xingang Wang

TL;DR
EgoVid-5M is a large-scale, high-quality dataset of 5 million egocentric video clips with detailed annotations, designed to advance research in egocentric video generation and enable new generative models like EgoDreamer.
Contribution
The paper introduces EgoVid-5M, the first extensive dataset for egocentric videos with detailed annotations, and presents EgoDreamer, a model capable of generating egocentric videos from descriptions and control signals.
Findings
EgoVid-5M contains 5 million annotated egocentric clips.
EgoDreamer can generate realistic egocentric videos from textual and kinematic inputs.
The dataset and model facilitate progress in egocentric video synthesis.
Abstract
Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of egocentric viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Human Motion and Animation
