Position Embedding Needs an Independent Layer Normalization
Runyi Yu, Zhennan Wang, Yinhuai Wang, Kehan Li, Yian Zhao, Jian Zhang,, Guoli Song, Jie Chen

TL;DR
This paper introduces LaPE, a simple method that independently normalizes position embeddings and token embeddings in Vision Transformers, significantly improving performance and robustness with minimal extra cost.
Contribution
The paper proposes Layer-adaptive Position Embedding (LaPE), which uses independent layer normalization for position and token embeddings, enhancing expressiveness and performance of Vision Transformers.
Findings
LaPE improves accuracy across multiple Vision Transformer models.
LaPE enhances robustness to different position embedding types.
LaPE adds negligible computational overhead.
Abstract
The Position Embedding (PE) is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operation. By analyzing the input and output of each encoder layer in VTs using reparameterization and visualization, we find that the default PE joining method (simply adding the PE and patch embedding together) operates the same affine transformation to token embedding and PE, which limits the expressiveness of PE and hence constrains the performance of VTs. To overcome this limitation, we propose a simple, effective, and robust method. Specifically, we provide two independent layer normalizations for token embeddings and PE for each layer, and add them together as the input of each layer's Muti-Head Self-Attention module. Since the method allows the model to adaptively adjust the information of PE for different layers, we name it as Layer-adaptive Position…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Tactile and Sensory Interactions · Gaze Tracking and Assistive Technology
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Layer Normalization · Dropout · Byte Pair Encoding · Linear Layer · Dense Connections · Feedforward Network · Convolution
