How You Move Your Head Tells What You Do: Self-supervised Video   Representation Learning with Egocentric Cameras and IMU Sensors

Satoshi Tsutsui; Ruta Desai; Karl Ridgeway

arXiv:2110.01680·cs.CV·October 6, 2021

How You Move Your Head Tells What You Do: Self-supervised Video Representation Learning with Egocentric Cameras and IMU Sensors

Satoshi Tsutsui, Ruta Desai, Karl Ridgeway

PDF

Open Access

TL;DR

This paper introduces a self-supervised learning method that leverages head-motion data from IMU sensors to learn video representations for recognizing activities in egocentric videos, reducing reliance on labeled data.

Contribution

The work presents a novel SSL approach that uses head-motion data to learn video representations, improving activity recognition without extensive manual annotations.

Findings

01

Effective activity recognition for humans and dogs.

02

Improved representation quality with self-supervised learning.

03

Reduced need for labeled data.

Abstract

Understanding users' activities from head-mounted cameras is a fundamental task for Augmented and Virtual Reality (AR/VR) applications. A typical approach is to train a classifier in a supervised manner using data labeled by humans. This approach has limitations due to the expensive annotation cost and the closed coverage of activity labels. A potential way to address these limitations is to use self-supervised learning (SSL). Instead of relying on human annotations, SSL leverages intrinsic properties of data to learn representations. We are particularly interested in learning egocentric video representations benefiting from the head-motion generated by users' daily activities, which can be easily obtained from IMU sensors embedded in AR/VR devices. Towards this goal, we propose a simple but effective approach to learn video representation by learning to tell the corresponding pairs of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Multimodal Machine Learning Applications