Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos
Gautam Singh, Yi-Fu Wu, Sungjin Ahn

TL;DR
STEVE is a simple yet effective unsupervised object-centric learning model for complex naturalistic videos, achieving significant improvements without added complexity or supervision.
Contribution
It introduces a straightforward transformer-based architecture for object-centric learning in videos, capable of handling complex scenes without additional supervision.
Findings
Outperforms previous methods on complex naturalistic videos
Uses a simple architecture without extra supervision
Achieves significant improvements in object-centric learning
Abstract
Unsupervised object-centric learning aims to represent the modular, compositional, and causal structure of a scene as a set of object representations and thereby promises to resolve many critical limitations of traditional single-vector representations such as poor systematic generalization. Although there have been many remarkable advances in recent years, one of the most critical problems in this direction has been that previous methods work only with simple and synthetic scenes but not with complex and naturalistic images or videos. In this paper, we propose STEVE, an unsupervised model for object-centric learning in videos. Our proposed model makes a significant advancement by demonstrating its effectiveness on various complex and naturalistic videos unprecedented in this line of research. Interestingly, this is achieved by neither adding complexity to the model architecture nor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning · Advanced Vision and Imaging
