EgoDistill: Egocentric Head Motion Distillation for Efficient Video   Understanding

Shuhan Tan; Tushar Nagarajan; Kristen Grauman

arXiv:2301.02217·cs.CV·January 6, 2023·6 cites

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Shuhan Tan, Tushar Nagarajan, Kristen Grauman

PDF

Open Access 1 Video

TL;DR

EgoDistill is a novel distillation approach that combines sparse video semantics with lightweight IMU head motion data to achieve highly efficient egocentric video understanding, reducing computational costs significantly.

Contribution

The paper introduces EgoDistill, a new method that distills heavy video features using sparse frames and IMU data, with a self-supervised IMU training strategy, improving efficiency and performance.

Findings

01

Requires 200x fewer GFLOPs than comparable models

02

Outperforms state-of-the-art efficient video understanding methods

03

Effective on Ego4D and EPICKitchens datasets

Abstract

Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight IMU readings. We further devise a novel self-supervised training strategy for IMU feature learning. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging

MethodsContrastive Language-Image Pre-training