Learning Long-form Video Prior via Generative Pre-Training

Jinheng Xie; Jiajun Feng; Zhaoxu Tian; Kevin Qinghong Lin; Yawen; Huang; Xi Xia; Nanxu Gong; Xu Zuo; Jiaqi Yang; Yefeng Zheng; Mike Zheng Shou

arXiv:2404.15909·cs.CV·April 25, 2024

Learning Long-form Video Prior via Generative Pre-Training

Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen, Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper explores using generative pre-training on tokenized visual location data from long-form videos to learn their implicit priors, introducing a new dataset and demonstrating promising results.

Contribution

It proposes a novel approach of applying GPT to tokenized visual location data for long-form video modeling and introduces the new Storyboard20K dataset.

Findings

01

Effective learning of long-form video prior demonstrated

02

Dataset enables better modeling of complex video concepts

03

Generative pre-training shows promising results for video understanding

Abstract

Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning long-form video prior? Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT. Due to the scarcity of suitable data, we create a new dataset called \textbf{Storyboard20K} from movies to serve as a representative. It includes synopses, shot-by-shot keyframes, and fine-grained annotations of film sets and characters with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showlab/long-form-video-prior
pytorchOfficial

Datasets

Silin1590/VinaBench
dataset· 106 dl
106 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Sparse Evolutionary Training · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Dense Connections · Adam · Layer Normalization · Attention Dropout