Unsupervised Multi-object Segmentation Using Attention and Soft-argmax
Bruno Sauvalle, Arnaud de La Fortelle

TL;DR
This paper presents an unsupervised architecture for multi-object segmentation that leverages attention mechanisms and a transformer encoder to improve detection and segmentation accuracy in complex scenes.
Contribution
The novel architecture combines attention, transformer, and autoencoder components for unsupervised multi-object segmentation, outperforming previous methods on synthetic benchmarks.
Findings
Significantly outperforms state-of-the-art on synthetic benchmarks
Uses attention and transformer for occlusion handling
Effective background reconstruction with autoencoder
Abstract
We introduce a new architecture for unsupervised object-centric representation learning and multi-object detection and segmentation, which uses a translation-equivariant attention mechanism to predict the coordinates of the objects present in the scene and to associate a feature vector to each object. A transformer encoder handles occlusions and redundant detections, and a convolutional autoencoder is in charge of background reconstruction. We show that this architecture significantly outperforms the state of the art on complex synthetic benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Unsupervised multi-object segmentation using attention and soft-argmax· youtube
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
