MUFASA: A Multi-Layer Framework for Slot Attention

Sebastian Bock; Leonie Sch\"u{\ss}ler; Krishnakant Singh; Simone Schaub-Meyer; Stefan Roth

arXiv:2602.07544·cs.CV·February 10, 2026

MUFASA: A Multi-Layer Framework for Slot Attention

Sebastian Bock, Leonie Sch\"u{\ss}ler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

PDF

Open Access

TL;DR

MUFASA enhances unsupervised object segmentation by leveraging multi-layer semantic information from vision transformers, improving accuracy and convergence with minimal overhead.

Contribution

Introduces MUFASA, a multi-layer framework that integrates semantic-rich features from all ViT layers into slot attention for better object segmentation.

Findings

01

Achieves state-of-the-art segmentation results on multiple datasets.

02

Improves training convergence speed.

03

Adds minimal inference overhead.

Abstract

Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications