Maximizing the Position Embedding for Vision Transformers with Global   Average Pooling

Wonjun Lee; Bumsub Ham; Suhyun Kim

arXiv:2502.02919·cs.CV·February 6, 2025·3 cites

Maximizing the Position Embedding for Vision Transformers with Global Average Pooling

Wonjun Lee, Bumsub Ham, Suhyun Kim

PDF

Open Access

TL;DR

This paper introduces MPVG, a method to enhance position embeddings in vision transformers with global average pooling, improving their effectiveness by addressing counterbalancing issues and outperforming existing approaches.

Contribution

Proposes MPVG, a novel technique to maximize position embedding effectiveness in layer-wise vision transformers with GAP, addressing counterbalancing limitations.

Findings

01

MPVG improves vision transformer performance across multiple tasks.

02

Position embeddings serve a counterbalancing role in layer-wise structures.

03

Maximizing PE effectiveness leads to significant accuracy gains.

Abstract

In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position embedding is simply added to the token embedding. A layer-wise method that delivers PE to each layer and applies independent Layer Normalizations for token embedding and PE has been adopted to overcome this limitation. In this paper, we identify the conflicting result that occurs in a layer-wise structure when using the global average pooling (GAP) method instead of the class token. To overcome this problem, we propose MPVG, which maximizes the effectiveness of PE in a layer-wise structure with GAP. Specifically, we identify that PE counterbalances token embedding values at each layer in a layer-wise structure. Furthermore, we recognize that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors · Robotics and Sensor-Based Localization

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Multi-Head Attention · Average Pooling · Vision Transformer · Global Average Pooling