EgoM2P: Egocentric Multimodal Multitask Pretraining

Gen Li; Yutong Chen; Yiqian Wu; Kaifeng Zhao; Marc Pollefeys; Siyu Tang

arXiv:2506.07886·cs.CV·July 22, 2025

EgoM2P: Egocentric Multimodal Multitask Pretraining

Gen Li, Yutong Chen, Yiqian Wu, Kaifeng Zhao, Marc Pollefeys, Siyu Tang

PDF

Open Access

TL;DR

EgoM2P is a large, efficient, multimodal pretraining framework for egocentric vision that supports multiple perception and synthesis tasks, outperforming specialized models and enabling comprehensive understanding of first-person video data.

Contribution

The paper introduces EgoM2P, a novel masked modeling framework with temporal tokenizers for egocentric multimodal pretraining, supporting diverse tasks with improved speed and performance.

Findings

01

EgoM2P matches or exceeds specialist models in various tasks.

02

EgoM2P is an order of magnitude faster than existing models.

03

The framework effectively supports multitasking in egocentric vision.

Abstract

Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Gaze Tracking and Assistive Technology · Advanced Vision and Imaging

MethodsAttentive Walk-Aggregating Graph Neural Network · Sparse Evolutionary Training