X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding
Wenqi Zhou, Kai Cao, Hao Zheng, Yunze Liu, Xinyi Zheng, Miao Liu, Per Ola Kristensson, Walterio Mayol-Cuevas, Fan Zhang, Weizhe Lin, Junxiao Shen

TL;DR
X-LeBench introduces a new benchmark dataset for evaluating the understanding of extremely long egocentric videos, addressing a significant gap in existing datasets and highlighting challenges for current models.
Contribution
The paper presents X-LeBench, a novel dataset with realistic, long-duration egocentric videos and a simulation pipeline for comprehensive long-term activity analysis.
Findings
Baseline systems perform poorly on long videos
Challenges include temporal localization and context reasoning
Highlights need for advanced models in long-form video understanding
Abstract
Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short (\eg, minutes to tens of minutes) to moderately long videos, leaving a substantial gap in evaluating extensive, ultra-long egocentric video recordings. To address this, we introduce X-LeBench, a novel benchmark dataset meticulously designed to fill this gap by focusing on tasks requiring a comprehensive understanding of extremely long egocentric video recordings. Our X-LeBench develops a life-logging simulation pipeline that produces realistic, coherent daily plans aligned with real-world video data. This approach enables the flexible integration of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsFocus
