Is an Object-Centric Video Representation Beneficial for Transfer?

Chuhan Zhang; Ankush Gupta; Andrew Zisserman

arXiv:2207.10075·cs.CV·October 11, 2022

Is an Object-Centric Video Representation Beneficial for Transfer?

Chuhan Zhang, Ankush Gupta, Andrew Zisserman

PDF

Open Access

TL;DR

This paper introduces an object-centric video recognition model based on transformers that improves transferability to new tasks by learning object-focused representations and using a novel contrast loss.

Contribution

The paper presents a new transformer-based object-centric video model with a trajectory contrast loss, enhancing transferability and performance on various downstream tasks.

Findings

01

Outperforms prior video representations on unseen objects and environments

02

Improves low-shot learning of novel classes

03

Enhances linear probe performance on downstream tasks

Abstract

The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. With experiments on four datasets -- SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens -- we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning