Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators
Wentao Zhang, Junliang Guo, Tianyu He, Li Zhao, Linli Xu, Jiang Bian

TL;DR
This paper demonstrates that autoregressive Transformers trained on videos can zero-shot learn and imitate unseen tasks by understanding demonstration videos, enabling in-context task execution without additional training.
Contribution
It introduces a novel approach where autoregressive Transformers learn to interpret and imitate video demonstrations in a zero-shot manner, expanding the capabilities of visual signal-based interaction.
Findings
Models can generate semantically aligned videos based on demonstration videos.
Imitation capacity improves with model scaling.
The approach enables zero-shot task execution from videos.
Abstract
People interact with the real-world largely dependent on visual signal, which are ubiquitous and illustrate detailed demonstrations. In this paper, we explore utilizing visual signals as a new interface for models to interact with the environment. Specifically, we choose videos as a representative visual signal. And by training autoregressive Transformers on video datasets in a self-supervised objective, we find that the model emerges a zero-shot capability to infer the semantics from a demonstration video, and imitate the semantics to an unseen scenario. This allows the models to perform unseen tasks by watching the demonstration video in an in-context manner, without further fine-tuning. To validate the imitation capacity, we design various evaluation metrics including both objective and subjective measures. The results show that our models can generate high-quality video clips that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Contrastive Language-Image Pre-training · Byte Pair Encoding · Layer Normalization · ALIGN · Label Smoothing
