Conditional Object-Centric Learning from Video
Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone,, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus, Greff

TL;DR
This paper introduces a weakly-supervised extension of Slot Attention that leverages video dynamics and initial object location hints to improve object segmentation and tracking in realistic synthetic scenes, enabling better generalization and interaction.
Contribution
It proposes a sequential Slot Attention model conditioned on simple hints and optical flow, enhancing object segmentation and tracking in complex video data with minimal supervision.
Findings
Conditioning on object hints improves segmentation accuracy.
Model generalizes to new objects and backgrounds.
Initial-state-conditioning enables object querying during inference.
Abstract
Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
