MOSO: Decomposing MOtion, Scene and Object for Video Prediction
Mingzhen Sun, Weining Wang, Xinxin Zhu, Jing Liu

TL;DR
MOSO introduces a novel two-stage framework decomposing videos into motion, scene, and object components for improved prediction, generation, and interpolation, achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper presents MOSO, a new decomposition-based framework combining VQVAE and Transformer for enhanced video prediction and generation.
Findings
Achieves state-of-the-art performance on five benchmarks.
Effectively decomposes videos into meaningful components.
Enables realistic video synthesis by combining different objects and scenes.
Abstract
Motion, scene and object are three primary visual components of a video. In particular, objects represent the foreground, scenes represent the background, and motion traces their dynamics. Based on this insight, we propose a two-stage MOtion, Scene and Object decomposition framework (MOSO) for video prediction, consisting of MOSO-VQVAE and MOSO-Transformer. In the first stage, MOSO-VQVAE decomposes a previous video clip into the motion, scene and object components, and represents them as distinct groups of discrete tokens. Then, in the second stage, MOSO-Transformer predicts the object and scene tokens of the subsequent video clip based on the previous tokens and adds dynamic motion at the token level to the generated object and scene tokens. Our framework can be easily extended to unconditional video generation and video frame interpolation tasks. Experimental results demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image Processing Techniques · Human Pose and Action Recognition
MethodsContrastive Language-Image Pre-training
