Learning Long-form Video Prior via Generative Pre-Training
Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen, Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou

TL;DR
This paper explores using generative pre-training on tokenized visual location data from long-form videos to learn their implicit priors, introducing a new dataset and demonstrating promising results.
Contribution
It proposes a novel approach of applying GPT to tokenized visual location data for long-form video modeling and introduces the new Storyboard20K dataset.
Findings
Effective learning of long-form video prior demonstrated
Dataset enables better modeling of complex video concepts
Generative pre-training shows promising results for video understanding
Abstract
Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning long-form video prior? Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT. Due to the scarcity of suitable data, we create a new dataset called \textbf{Storyboard20K} from movies to serve as a representative. It includes synopses, shot-by-shot keyframes, and fine-grained annotations of film sets and characters with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Sparse Evolutionary Training · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Dense Connections · Adam · Layer Normalization · Attention Dropout
