Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia,, Yixiao Ge

TL;DR
This paper introduces a novel pretraining task called TVTS that leverages natural speech transcripts to improve the transferability of spatiotemporal video representations, outperforming existing methods on various benchmarks.
Contribution
It proposes a new self-supervised pretext task using natural speech transcripts to enhance spatiotemporal video understanding without relying on descriptive captions.
Findings
+13.6% gains over VideoMAE on SSV2 with linear probing
Effective use of natural speech transcripts for video pretraining
Strong out-of-the-box performance on diverse benchmarks
Abstract
Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal semantics, which hinders further progress in video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Human Pose and Action Recognition
