Unsupervised Visual Representation Learning by Tracking Patches in Video

Guangting Wang; Yizhou Zhou; Chong Luo; Wenxuan Xie; Wenjun Zeng; and; Zhiwei Xiong

arXiv:2105.02545·cs.CV·May 7, 2021·1 cites

Unsupervised Visual Representation Learning by Tracking Patches in Video

Guangting Wang, Yizhou Zhou, Chong Luo, Wenxuan Xie, Wenjun Zeng, and, Zhiwei Xiong

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel unsupervised video representation learning method called Catch-the-Patch (CtP), which uses a tracking proxy task inspired by childhood development to improve video understanding and transferability.

Contribution

It proposes a new pretraining framework using patch tracking in videos, demonstrating superior performance and domain robustness over existing methods.

Findings

01

CtP outperforms other video pretraining methods on benchmarks.

02

Pretrained features are less sensitive to domain gaps.

03

CtP achieves higher action classification accuracy than supervised models on some datasets.

Abstract

Inspired by the fact that human eyes continue to develop tracking ability in early and middle childhood, we propose to use tracking as a proxy task for a computer vision system to learn the visual representations. Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations that would help with video-related tasks. In the proposed pretraining framework, we cut an image patch from a given video and let it scale and move according to a pre-set trajectory. The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame. We discover that using multiple image patches simultaneously brings clear benefits. We further increase the difficulty of the game by randomly making patches invisible. Extensive experiments on mainstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/CtP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning