Bridging the Gap to Real-World Object-Centric Learning
Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow,, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard, Sch\"olkopf, Thomas Brox, Francesco Locatello

TL;DR
This paper introduces DINOSAUR, an unsupervised object-centric learning model that reconstructs features from self-supervised trained models, enabling it to scale from simulated to real-world datasets like COCO and PASCAL VOC.
Contribution
DINOSAUR demonstrates that feature reconstruction from self-supervised models is sufficient for unsupervised object-centric learning, surpassing existing methods on real-world datasets.
Findings
Outperforms existing models on simulated data
First to scale to real-world datasets like COCO and PASCAL VOC
Achieves competitive performance with more complex pipelines
Abstract
Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in the world. Allowing machine learning algorithms to derive this decomposition in an unsupervised way has become an important line of research. However, current methods are restricted to simulated data or require additional information in the form of motion or depth in order to successfully discover objects. In this work, we overcome this limitation by showing that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way. Our approach, DINOSAUR, significantly out-performs existing image-based object-centric learning models on simulated data and is the first unsupervised object-centric model that scales to real-world datasets such as COCO and PASCAL VOC. DINOSAUR is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
