What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization
Ryota Yoshihashi, Masahiro Kada, Satoshi Ikehata, Rei Kawakami, Ikuro Sato

TL;DR
The paper introduces the What-Where Transformer (WWT), a novel vision transformer backbone that explicitly separates object appearance and location representations, improving object discovery and localization tasks.
Contribution
It proposes a slot-centric, multi-stream architecture that treats tokens as appearance and attention maps as location, enabling better localization learning and emergent object discovery.
Findings
Emergent multiple object discovery from raw attention maps.
Superior zero-shot object discovery performance.
Enhanced weakly supervised semantic segmentation.
Abstract
Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
