EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li, Yutong Chen, Yiqian Wu, Kaifeng Zhao, Marc Pollefeys, Siyu Tang

TL;DR
EgoM2P is a large, efficient, multimodal pretraining framework for egocentric vision that supports multiple perception and synthesis tasks, outperforming specialized models and enabling comprehensive understanding of first-person video data.
Contribution
The paper introduces EgoM2P, a novel masked modeling framework with temporal tokenizers for egocentric multimodal pretraining, supporting diverse tasks with improved speed and performance.
Findings
EgoM2P matches or exceeds specialist models in various tasks.
EgoM2P is an order of magnitude faster than existing models.
The framework effectively supports multitasking in egocentric vision.
Abstract
Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Gaze Tracking and Assistive Technology · Advanced Vision and Imaging
MethodsAttentive Walk-Aggregating Graph Neural Network · Sparse Evolutionary Training
