Video Prediction Models as General Visual Encoders

James Maier; Nishanth Mohankumar

arXiv:2405.16382·cs.CV·May 28, 2024

Video Prediction Models as General Visual Encoders

James Maier, Nishanth Mohankumar

PDF

Open Access

TL;DR

This paper investigates using open-source video prediction models as general visual encoders for downstream tasks like instance segmentation, leveraging their ability to encode spatial and temporal information effectively.

Contribution

It introduces a novel approach of employing video prediction models as encoders, inspired by human vision principles, to improve scene understanding and segmentation.

Findings

01

Pre-trained video generative models can be adapted for segmentation tasks.

02

Latent spaces of video models encode meaningful motion and spatial information.

03

Promising results in leveraging generative models for downstream computer vision tasks.

Abstract

This study explores the potential of open-source video conditional generation models as encoders for downstream tasks, focusing on instance segmentation using the BAIR Robot Pushing Dataset. The researchers propose using video prediction models as general visual encoders, leveraging their ability to capture critical spatial and temporal information which is essential for tasks such as instance segmentation. Inspired by human vision studies, particularly Gestalts principle of common fate, the approach aims to develop a latent space representative of motion from images to effectively discern foreground from background information. The researchers utilize a 3D Vector-Quantized Variational Autoencoder 3D VQVAE video generative encoder model conditioned on an input frame, coupled with downstream segmentation tasks. Experiments involve adapting pre-trained video generative models, analyzing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Anomaly Detection Techniques and Applications