Is an Object-Centric Video Representation Beneficial for Transfer?
Chuhan Zhang, Ankush Gupta, Andrew Zisserman

TL;DR
This paper introduces an object-centric video recognition model based on transformers that improves transferability to new tasks by learning object-focused representations and using a novel contrast loss.
Contribution
The paper presents a new transformer-based object-centric video model with a trajectory contrast loss, enhancing transferability and performance on various downstream tasks.
Findings
Outperforms prior video representations on unseen objects and environments
Improves low-shot learning of novel classes
Enhances linear probe performance on downstream tasks
Abstract
The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. With experiments on four datasets -- SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens -- we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
