Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Chuhan Zhang; Ankush Gupta; Andrew Zisserman

arXiv:2308.07918·cs.CV·August 16, 2023

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Chuhan Zhang, Ankush Gupta, Andrew Zisserman

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents an object-aware decoder for ego-centric videos that enhances spatio-temporal representations by predicting object and hand positions, leading to improved performance in various downstream tasks without requiring explicit object tracking at inference.

Contribution

The introduction of an object-aware decoder that improves ego-centric video understanding by integrating object and hand position prediction during training, with strong transfer and grounding capabilities.

Findings

01

Improved zero-shot performance on video-text benchmarks.

02

Enhanced object grounding and bounding box accuracy.

03

Better long-term video understanding in downstream tasks.

Abstract

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chuhanxx/helping_hand_for_egocentric_videos
pytorchOfficial

Videos

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning