Position Embedding Needs an Independent Layer Normalization

Runyi Yu; Zhennan Wang; Yinhuai Wang; Kehan Li; Yian Zhao; Jian Zhang,; Guoli Song; Jie Chen

arXiv:2212.05262·cs.CV·December 23, 2022·1 cites

Position Embedding Needs an Independent Layer Normalization

Runyi Yu, Zhennan Wang, Yinhuai Wang, Kehan Li, Yian Zhao, Jian Zhang,, Guoli Song, Jie Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces LaPE, a simple method that independently normalizes position embeddings and token embeddings in Vision Transformers, significantly improving performance and robustness with minimal extra cost.

Contribution

The paper proposes Layer-adaptive Position Embedding (LaPE), which uses independent layer normalization for position and token embeddings, enhancing expressiveness and performance of Vision Transformers.

Findings

01

LaPE improves accuracy across multiple Vision Transformer models.

02

LaPE enhances robustness to different position embedding types.

03

LaPE adds negligible computational overhead.

Abstract

The Position Embedding (PE) is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operation. By analyzing the input and output of each encoder layer in VTs using reparameterization and visualization, we find that the default PE joining method (simply adding the PE and patch embedding together) operates the same affine transformation to token embedding and PE, which limits the expressiveness of PE and hence constrains the performance of VTs. To overcome this limitation, we propose a simple, effective, and robust method. Specifically, we provide two independent layer normalizations for token embeddings and PE for each layer, and add them together as the input of each layer's Muti-Head Self-Attention module. Since the method allows the model to adaptively adjust the information of PE for different layers, we name it as Layer-adaptive Position…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ingrid725/lape
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Tactile and Sensory Interactions · Gaze Tracking and Assistive Technology

MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Layer Normalization · Dropout · Byte Pair Encoding · Linear Layer · Dense Connections · Feedforward Network · Convolution