On Separate Normalization in Self-supervised Transformers

Xiaohui Chen; Yinkai Wang; Yuanqi Du; Soha Hassoun; Li-Ping Liu

arXiv:2309.12931·cs.CL·November 30, 2023·1 cites

On Separate Normalization in Self-supervised Transformers

Xiaohui Chen, Yinkai Wang, Yuanqi Du, Soha Hassoun, Li-Ping Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simple yet effective modification to self-supervised transformer models by using separate normalization layers for tokens and the [CLS] symbol, leading to improved performance across multiple domains.

Contribution

It proposes a novel approach of employing separate normalization layers for tokens and [CLS], which enhances the encoding of global information and improves downstream task results.

Findings

01

2.7% average performance improvement across domains

02

Better encoding of global context in [CLS] embeddings

03

More uniform distribution of [CLS] embeddings

Abstract

Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the [CLS] symbol and the tokens. We propose in this paper a simple modification that employs separate normalization layers for the tokens and the [CLS] symbol to better capture their distinct characteristics and enhance downstream task performance. Our method aims to alleviate the potential negative effects of using the same normalization statistics for both token types, which may not be optimally aligned with their individual roles. We empirically show that by utilizing a separate normalization layer, the [CLS] embeddings can better encode the global contextual information and are distributed more uniformly in its anisotropic space. When replacing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tufts-ml/SepNorm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis · Neural Networks and Applications