Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos
Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin

TL;DR
This paper introduces a novel self-supervised method that combines semantic and temporal cues to improve object-centric learning in videos, achieving state-of-the-art results in unsupervised discovery and label propagation.
Contribution
It proposes a semantic-aware masked slot attention mechanism that integrates semantic segmentation and temporal correspondence for better object instance identification.
Findings
Effective identification of multiple object instances with semantic structure
State-of-the-art performance on dense label propagation tasks
Promising results in unsupervised video object discovery
Abstract
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
