Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?
Bill Psomas, Ioannis Kakogeorgiou, Konstantinos Karantzalos, Yannis, Avrithis

TL;DR
This paper introduces SimPool, a universal attention-based pooling method that enhances supervised and self-supervised vision transformers by improving performance and generating high-quality attention maps without architectural modifications.
Contribution
The paper develops a generic pooling framework and proposes SimPool, a simple attention-based pooling mechanism that improves transformer performance and attention map quality across supervision types.
Findings
SimPool improves performance on pre-training and downstream tasks.
SimPool generates high-quality attention maps in supervised transformers.
SimPool is effective for both convolutional and transformer encoders.
Abstract
Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem? In this work, we develop a generic pooling framework and then we formulate a number of existing methods as instantiations. By discussing the properties of each group of methods, we derive SimPool, a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. We find that, whether supervised or self-supervised, this improves performance on pre-training and downstream tasks and provides attention maps delineating object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗billpsomas/vits_dino_simpool_ep100model· ♡ 1♡ 1
- 🤗billpsomas/vits_dino_simpool_no_gamma_ep100model
- 🤗billpsomas/vits_dino_official_ep100model· 1 dl1 dl
- 🤗billpsomas/resnet50_dino_official_ep100model
- 🤗billpsomas/convnext_small_dino_official_ep100model
- 🤗billpsomas/resnet50_dino_simpool_no_gamma_ep100model
- 🤗billpsomas/convnext_small_dino_simpool_no_gamma_ep100model
- 🤗billpsomas/resnet50_dino_simpool_ep100model
- 🤗billpsomas/convnext_small_dino_simpool_ep100model
- 🤗billpsomas/vits_dino_simpool_ep300model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Memory and Neural Computing
