EAGLE: Egocentric AGgregated Language-video Engine
Jing Bi, Yunlong Tang, Luchuan Song, Ali Vosoughi, Nguyen Nguyen, and, Chenliang Xu

TL;DR
EAGLE introduces a unified egocentric video understanding framework with a large-scale dataset and a multimodal large language model, enabling comprehensive analysis of human activities and intentions from first-person videos.
Contribution
The paper presents EAGLE, a novel integrated model and the first large-scale instruction-tuning dataset for egocentric videos, advancing holistic understanding across multiple tasks.
Findings
EAGLE outperforms existing models in various egocentric video tasks.
The EAGLE-400K dataset enhances model training and generalization.
Proposed evaluation metrics facilitate comprehensive assessment.
Abstract
The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In response, we introduce the EAGLE (Egocentric AGgregated Language-video Engine) model and the EAGLE-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. EAGLE-400K, the \textit{first} large-scale instruction-tuning dataset tailored for egocentric video, features 400K diverse samples to enhance a broad spectrum of tasks from activity recognition to procedure knowledge learning. Moreover, EAGLE, a strong video multimodal large language model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Fragmentation
