Maximizing the Position Embedding for Vision Transformers with Global Average Pooling
Wonjun Lee, Bumsub Ham, Suhyun Kim

TL;DR
This paper introduces MPVG, a method to enhance position embeddings in vision transformers with global average pooling, improving their effectiveness by addressing counterbalancing issues and outperforming existing approaches.
Contribution
Proposes MPVG, a novel technique to maximize position embedding effectiveness in layer-wise vision transformers with GAP, addressing counterbalancing limitations.
Findings
MPVG improves vision transformer performance across multiple tasks.
Position embeddings serve a counterbalancing role in layer-wise structures.
Maximizing PE effectiveness leads to significant accuracy gains.
Abstract
In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position embedding is simply added to the token embedding. A layer-wise method that delivers PE to each layer and applies independent Layer Normalizations for token embedding and PE has been adopted to overcome this limitation. In this paper, we identify the conflicting result that occurs in a layer-wise structure when using the global average pooling (GAP) method instead of the class token. To overcome this problem, we propose MPVG, which maximizes the effectiveness of PE in a layer-wise structure with GAP. Specifically, we identify that PE counterbalances token embedding values at each layer in a layer-wise structure. Furthermore, we recognize that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors · Robotics and Sensor-Based Localization
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Multi-Head Attention · Average Pooling · Vision Transformer · Global Average Pooling
