Masked Multi-Query Slot Attention for Unsupervised Object Discovery
Rishav Pramanik, Jos\'e-Fabian Villa-V\'asquez, Marco Pedersoli

TL;DR
This paper introduces a masked multi-query slot attention method for unsupervised object discovery that improves object localization by focusing on salient regions and learning multiple slot sets, tested on PASCAL-VOC 2012.
Contribution
It proposes a novel masking scheme and multi-query slot attention extension that enhance unsupervised object discovery performance.
Findings
Improved object localization accuracy on PASCAL-VOC 2012.
Masking background regions enhances focus on salient objects.
Multi-query approach yields more stable and accurate masks.
Abstract
Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Web Data Mining and Analysis · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Layer Normalization · Multi-Head Attention · Dense Connections · Residual Connection · Softmax · Vision Transformer · self-DIstillation with NO labels
