TL;DR
HumanNet is a large-scale, richly annotated human-centric video dataset designed to advance embodied intelligence by enabling scalable, interaction-aware learning from diverse real-world human activity videos.
Contribution
The paper introduces HumanNet, a one-million-hour diverse, annotated human activity video corpus and a systematic data curation paradigm for scalable embodied learning.
Findings
Training with 1000 hours of egocentric video from HumanNet outperforms 100 hours of robot data in validation tasks.
HumanNet enables motion-aware and interaction-aware learning through detailed annotations.
The dataset supports various applications including representation learning, activity understanding, and human-robot transfer.
Abstract
Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
