Unlocking Slot Attention by Changing Optimal Transport Costs
Yan Zhang, David W. Zhang, Simon Lacoste-Julien, Gertjan J. Burghouts,, Cees G. M. Snoek

TL;DR
This paper introduces MESH, a novel cross-attention module that enhances slot attention by integrating optimal transport techniques, enabling better handling of dynamic object counts in videos and improving performance on object-centric benchmarks.
Contribution
It establishes a connection between slot attention and optimal transport, and proposes MESH, a new method that combines unregularized and regularized optimal transport for improved object modeling.
Findings
MESH significantly outperforms standard slot attention on multiple benchmarks.
The method effectively handles videos with a dynamic number of objects.
MESH improves tie-breaking in object-centric modeling.
Abstract
Slot attention is a powerful method for object-centric modeling in images and videos. However, its set-equivariance limits its ability to handle videos with a dynamic number of objects because it cannot break ties. To overcome this limitation, we first establish a connection between slot attention and optimal transport. Based on this new perspective we propose MESH (Minimize Entropy of Sinkhorn): a cross-attention module that combines the tiebreaking properties of unregularized optimal transport with the speed of regularized optimal transport. We evaluate slot attention using MESH on multiple object-centric learning benchmarks and find significant improvements over slot attention in every setting.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
MethodsConcatenated Skip Connection · Softmax · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
