TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at   Scale

Ziyun Zeng; Yixiao Ge; Zhan Tong; Xihui Liu; Shu-Tao Xia; Ying Shan

arXiv:2305.14173·cs.CV·May 24, 2023·2 cites

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

PDF

Open Access 1 Repo

TL;DR

TVTSv2 introduces a scalable, out-of-the-box video representation model that maintains generalization by freezing shallow text encoder layers, achieving state-of-the-art results without fine-tuning.

Contribution

The paper proposes a degradation-free pre-training strategy for video models that preserves text encoder generalization, enabling effective zero-shot performance.

Findings

01

Achieves state-of-the-art results on multiple video benchmarks.

02

Outperforms recent models like ImageBind and InternVideo.

03

Maintains high performance with a frozen backbone.

Abstract

The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models' prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencentarc/tvts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsInternVideo: General Video Foundation Models via Generative and Discriminative Learning