Spatial Entropy as an Inductive Bias for Vision Transformers
Elia Peruzzo, Enver Sangineto, Yahui Liu, Marco De Nadai, Wei Bi,, Bruno Lepri, Nicu Sebe

TL;DR
This paper introduces a novel regularization method for Vision Transformers that uses spatial entropy as an auxiliary self-supervised task, encouraging semantic segmentation structures and improving accuracy especially with limited training data.
Contribution
It proposes a new spatial entropy-based regularization technique that enhances Vision Transformers without altering their architecture, leveraging self-supervised learning to induce a local spatial bias.
Findings
Regularization improves Vision Transformer accuracy with small-medium datasets.
Method matches or exceeds performance of architecture-based local bias methods.
Spatial entropy regularization enhances semantic clustering in attention maps.
Abstract
Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · Remote-Sensing Image Classification · Image Processing Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Dense Connections · Absolute Position Encodings · Linear Layer · Label Smoothing · Dropout
