From My View to Yours: Ego-to-Exo Transfer in VLMs for Understanding Activities of Daily Living
Dominick Reilly, Manish Kumar Govind, Le Xue, Srijan Das

TL;DR
This paper introduces Ego2ExoVLM, a vision-language model that learns to infer egocentric properties from exocentric videos, enhancing understanding of daily activities without requiring egocentric cameras.
Contribution
It proposes a novel training framework with sequence distillation and adaptive tokens, and introduces Ego-in-Exo Perception, a benchmark for egocentric understanding from exocentric videos.
Findings
Achieves state-of-the-art on ADL-X benchmark.
Outperforms strong baselines on Ego-in-Exo Perception.
Effectively transfers egocentric knowledge to exocentric video understanding.
Abstract
Vision Language Models (VLMs) have achieved strong performance across diverse video understanding tasks. However, their viewpoint invariant training limits their ability to understand egocentric properties (e.g., human object interactions) from exocentric video observations. This limitation is critical for many applications, such as Activities of Daily Living (ADL) monitoring, where the understanding of egocentric properties is essential, and egocentric cameras are impractical to deploy. To address this limitation, we propose Ego2ExoVLM, a VLM that learns to infer egocentric properties from exocentric videos by leveraging time-synchronized ego-exo videos during training. Ego2ExoVLM accomplishes this through the use of two components: Ego2Exo Sequence Distillation, which transfers knowledge from an egocentric teacher to an exocentric student, and Ego Adaptive Visual Tokens, designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Multimodal Machine Learning Applications · Social Robot Interaction and HRI
MethodsKnowledge Distillation
