Video In-context Learning: Autoregressive Transformers are Zero-Shot   Video Imitators

Wentao Zhang; Junliang Guo; Tianyu He; Li Zhao; Linli Xu; Jiang Bian

arXiv:2407.07356·cs.CV·March 20, 2025

Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators

Wentao Zhang, Junliang Guo, Tianyu He, Li Zhao, Linli Xu, Jiang Bian

PDF

Open Access

TL;DR

This paper demonstrates that autoregressive Transformers trained on videos can zero-shot learn and imitate unseen tasks by understanding demonstration videos, enabling in-context task execution without additional training.

Contribution

It introduces a novel approach where autoregressive Transformers learn to interpret and imitate video demonstrations in a zero-shot manner, expanding the capabilities of visual signal-based interaction.

Findings

01

Models can generate semantically aligned videos based on demonstration videos.

02

Imitation capacity improves with model scaling.

03

The approach enables zero-shot task execution from videos.

Abstract

People interact with the real-world largely dependent on visual signal, which are ubiquitous and illustrate detailed demonstrations. In this paper, we explore utilizing visual signals as a new interface for models to interact with the environment. Specifically, we choose videos as a representative visual signal. And by training autoregressive Transformers on video datasets in a self-supervised objective, we find that the model emerges a zero-shot capability to infer the semantics from a demonstration video, and imitate the semantics to an unseen scenario. This allows the models to perform unseen tasks by watching the demonstration video in an in-context manner, without further fine-tuning. To validate the imitation capacity, we design various evaluation metrics including both objective and subjective measures. The results show that our models can generate high-quality video clips that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Contrastive Language-Image Pre-training · Byte Pair Encoding · Layer Normalization · ALIGN · Label Smoothing