EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding
Shuhan Tan, Tushar Nagarajan, Kristen Grauman

TL;DR
EgoDistill is a novel distillation approach that combines sparse video semantics with lightweight IMU head motion data to achieve highly efficient egocentric video understanding, reducing computational costs significantly.
Contribution
The paper introduces EgoDistill, a new method that distills heavy video features using sparse frames and IMU data, with a self-supervised IMU training strategy, improving efficiency and performance.
Findings
Requires 200x fewer GFLOPs than comparable models
Outperforms state-of-the-art efficient video understanding methods
Effective on Ego4D and EPICKitchens datasets
Abstract
Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight IMU readings. We further devise a novel self-supervised training strategy for IMU feature learning. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging
MethodsContrastive Language-Image Pre-training
