Compositional Video Synthesis by Temporal Object-Centric Learning
Adil Kaan Akan, Yucel Yemez

TL;DR
This paper introduces a new object-centric framework for video synthesis that achieves high-quality, temporally coherent videos with editable object components, advancing the state-of-the-art in controllable video generation.
Contribution
It extends object-centric learning from images to videos, incorporating temporal dynamics and diffusion models for improved synthesis and editing capabilities.
Findings
Sets new benchmarks in video quality and temporal coherence.
Enables intuitive object editing like insertion, deletion, and replacement.
Maintains consistent object identities across frames.
Abstract
We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations, extending our previous work, SlotAdapt, from images to video. While existing object-centric approaches either lack generative capabilities entirely or treat video sequences holistically, thus neglecting explicit object-level structure, our approach explicitly captures temporal dynamics by learning pose invariant object-centric slots and conditioning them on pretrained diffusion models. This design enables high-quality, pixel-level video synthesis with superior temporal coherence, and offers intuitive compositional editing capabilities such as object insertion, deletion, or replacement, maintaining consistent object identities across frames. Extensive experiments demonstrate that our method sets new benchmarks in video generation quality and temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
