MOSO: Decomposing MOtion, Scene and Object for Video Prediction

Mingzhen Sun; Weining Wang; Xinxin Zhu; Jing Liu

arXiv:2303.03684·cs.CV·March 17, 2023·1 cites

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

Mingzhen Sun, Weining Wang, Xinxin Zhu, Jing Liu

PDF

Open Access 2 Repos

TL;DR

MOSO introduces a novel two-stage framework decomposing videos into motion, scene, and object components for improved prediction, generation, and interpolation, achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper presents MOSO, a new decomposition-based framework combining VQVAE and Transformer for enhanced video prediction and generation.

Findings

01

Achieves state-of-the-art performance on five benchmarks.

02

Effectively decomposes videos into meaningful components.

03

Enables realistic video synthesis by combining different objects and scenes.

Abstract

Motion, scene and object are three primary visual components of a video. In particular, objects represent the foreground, scenes represent the background, and motion traces their dynamics. Based on this insight, we propose a two-stage MOtion, Scene and Object decomposition framework (MOSO) for video prediction, consisting of MOSO-VQVAE and MOSO-Transformer. In the first stage, MOSO-VQVAE decomposes a previous video clip into the motion, scene and object components, and represents them as distinct groups of discrete tokens. Then, in the second stage, MOSO-Transformer predicts the object and scene tokens of the subsequent video clip based on the previous tokens and adds dynamic motion at the token level to the generated object and scene tokens. Our framework can be easily extended to unconditional video generation and video frame interpolation tasks. Experimental results demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training