Learning Transferable Spatiotemporal Representations from Natural Script   Knowledge

Ziyun Zeng; Yuying Ge; Xihui Liu; Bin Chen; Ping Luo; Shu-Tao Xia,; Yixiao Ge

arXiv:2209.15280·cs.CV·March 14, 2023

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia,, Yixiao Ge

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel pretraining task called TVTS that leverages natural speech transcripts to improve the transferability of spatiotemporal video representations, outperforming existing methods on various benchmarks.

Contribution

It proposes a new self-supervised pretext task using natural speech transcripts to enhance spatiotemporal video understanding without relying on descriptive captions.

Findings

01

+13.6% gains over VideoMAE on SSV2 with linear probing

02

Effective use of natural speech transcripts for video pretraining

03

Strong out-of-the-box performance on diverse benchmarks

Abstract

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal semantics, which hinders further progress in video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencentarc/tvts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Human Pose and Action Recognition