Boosting Object Representation Learning via Motion and Object Continuity
Quentin Delfosse, Wolfgang Stammer, Thomas Rothenbacher, Dwarak, Vittal, Kristian Kersting

TL;DR
This paper introduces a Motion and Object Continuity (MOC) scheme that enhances unsupervised multi-object detection by leveraging object motion and continuity, leading to better object representations and improved performance in downstream tasks like Atari game playing.
Contribution
The paper proposes a flexible MOC scheme that integrates optical flow and a contrastive loss to improve object representations without requiring new architectures.
Findings
Significant improvements in object discovery and convergence speed.
Enhanced latent object representations for downstream tasks.
Better performance in Atari game playing scenarios.
Abstract
Recent unsupervised multi-object detection models have shown impressive performance improvements, largely attributed to novel architectural inductive biases. Unfortunately, they may produce suboptimal object encodings for downstream tasks. To overcome this, we propose to exploit object motion and continuity, i.e., objects do not pop in and out of existence. This is accomplished through two mechanisms: (i) providing priors on the location of objects through integration of optical flow, and (ii) a contrastive object continuity loss across consecutive image frames. Rather than developing an explicit deep architecture, the resulting Motion and Object Continuity (MOC) scheme can be instantiated using any baseline object detection model. Our results show large improvements in the performances of a SOTA model in terms of object discovery, convergence speed and overall latent object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
