Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Chuhan Zhang, Ankush Gupta, Andrew Zisserman

TL;DR
This paper presents an object-aware decoder for ego-centric videos that enhances spatio-temporal representations by predicting object and hand positions, leading to improved performance in various downstream tasks without requiring explicit object tracking at inference.
Contribution
The introduction of an object-aware decoder that improves ego-centric video understanding by integrating object and hand position prediction during training, with strong transfer and grounding capabilities.
Findings
Improved zero-shot performance on video-text benchmarks.
Enhanced object grounding and bounding box accuracy.
Better long-term video understanding in downstream tasks.
Abstract
We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
